[jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers

Chris A. Mattmann (JIRA) Wed, 13 Jun 2007 07:06:49 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504246
 ]


Chris A. Mattmann commented on TIKA-7:
--------------------------------------

Jukka,

 Thanks for spearheading the lead on this. I think it's important to note that 
patches to Tika should follow this type of standard layed out, e.g., use the 
org.apache.tika namespace, make sure that unit tests are placed in the right 
place, that resources are as well, etc. I am on travel right now, but I will 
take a look at this patch as soon as I get back to Los Angeles.

 One question I have is, have we standardized on the following issues (I know 
they were discussed at ApacheCon at the BoF, as I've seen conversation on the 
dev list regarding it, however, I wasn' there :) ):

1. standardization of Parser interface?
2. control flow of Tika parsers (e.g., similar to Bertrand's email 
http://www.nabble.com/-RT--Tika-framework-usage-scenario-tf3913308.html)
3. major features that we want for 0.1 release

 I think that these questions need to be answered before we move forward with 
more code development. I realize that I've been out of the loop for a bit of 
time, however, I'm starting to have some time now to get back into the loop :) 
So, let's discuss. Here are my propositions for issues 1-3 above:

1. I like Bertrand's idea of a pipeline-based Tika framework. I think that the 
"ContentFilter" that he proposes is essentially this Parser interface that we 
are talking about. Immediate questions that come to mind are:
  a. Could the ContentFilter be run in single filter mode, e.g., from the 
command line? I think that a use case for Tika should be that all parsers are 
executable in some fashion (even if only for testing) from the command line. 
The parsed content should be returned as some form of a Metadata object, in 
which the user can inspect the parsed information. Perhaps other information 
should be returned as well, but that's what I thought off of the top of my head.
  b. Would this pipeline model still support the use cases for Nutch, and other 
initial projects that we were targeting as customers of Tika? Nutch's parse 
plugins are currently more single content parsing plugins, however, I think 
they could still be handled by this pipeline framework. I just want to get 
everyone else's opinion on it?

2. See my questions in #1 above
3. I think that we should plan to have the following features in the 0.1 
release of Tika:
   a. Basic parsing capability, +1 for using pipelining, but we need to 
standardize the interfaces for those/talk about architecture
   b. Content Type identification (e.g., MimeType identification)
   c. Basic metadata extraction capabilities
   d. Limited set of known parsing of content types, e.g., HTML, and PDF


What does everyone else think?

> Lius Lite remove all lucene dependencies from Lius  and use Nutch office 
> parsers
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-7
>                 URL: https://issues.apache.org/jira/browse/TIKA-7
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>         Environment: Java 1.5
>            Reporter: Rida Benjelloun
>         Attachments: liuslite.patch, liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene 
> dependencies and use Nutch Office parsers because they are based on Apache 
> POI.
> Lius Lite offer 4 ways  for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information 
> to extract.  Please see config.xml in the config folder
> See also Junit tests.
> Here is an example  of XML parsing :
> 1- XML Config
>               <parser name="text-xml" class="liuslite.parser.xml.XMLParser">  
>                 
>                               
> <namespace>http://purl.org/dc/elements/1.1/</namespace>
>                               <mime>application/xml</mime>
>                               <extract>
>                                       <content name="title" 
> xpathSelect="//dc:title"/>
>                                       <content name="subject" 
> xpathSelect="//dc:subject"/>
>                                       <content name="creator" 
> xpathSelect="//dc:creator"/>
>                                       <content name="description" 
> xpathSelect="//dc:description"/>
>                                       <content name="publisher" 
> xpathSelect="//dc:publisher"/>
>                                       <content name="contributor" 
> xpathSelect="//dc:contributor"/>
>                                       <content name="type" 
> xpathSelect="//dc:type"/>
>                                       <content name="format" 
> xpathSelect="//dc:format"/>
>                                       <content name="identifier" 
> xpathSelect="//dc:identifier"/>
>                                       <content name="language" 
> xpathSelect="//dc:language"/>
>                                       <content name="rights" 
> xpathSelect="//dc:rights"/>
>                                       <content name="outLinks">
>                                               <regexSelect>
>                                                       <![CDATA[
>                                                               
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
>                                                       ]]>
>                                               </regexSelect>
>                                       </content>
>                               </extract>                      
>               </parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"; 
> xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/";>
>       <dc:title>Archimède et Lius</dc:title>
>       <dc:creator>Rida Benjelloun</dc:creator>
>       <dc:subject>Java</dc:subject>
>       <dc:subject>XML</dc:subject>
>       <dc:subject>XSLT</dc:subject>
>       <dc:subject>JDOM</dc:subject>
>       <dc:subject>Indexation</dc:subject>
>       <dc:description>Framework d'indexation des documents XML, HTML, PDF 
> etc.. </dc:description>
>       <dc:publisher>Doculibre</dc:publisher>
>       <dc:identifier>http://www.apache.org</dc:identifier>
>       <dc:date>2000-12</dc:date>
>       <dc:type>test</dc:type>
>       <dc:format>application/msword</dc:format>
>       <dc:language>Fr</dc:language>
>       <dc:rights>Non restreint</dc:rights>    
> </oaidc:dc>
> 3- Java Code 
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
>       Parser  parser = ParserFactory.getParser(testFile, lc);
>         String fullText = parser.getContentStr();
>         
>         Content title = parser.getContent("title");
>         String titleStr = title.getValue();
>         
>         Content subject = parser.getContent("subject");
>         String[] subjects = subject.getValues();
>         etc ...
>         Or : 
>         List<Content> contents = parser.getContents();
>         
>      } catch (MimeInfoException e) {
>        e.printStackTrace();
>      } catch (IOException e) {
>       e.printStackTrace();
>      } catch (LiusException e) {
>       e.printStackTrace();
>       }
> best regards
> Rida Benjelloun

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers

Reply via email to