[jira] Updated: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers

Rida Benjelloun (JIRA) Sat, 09 Jun 2007 22:30:49 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rida Benjelloun updated TIKA-7:
-------------------------------

    Attachment: liusLite.zip

Lius Lite Source Code

> Lius Lite remove all lucene dependencies from Lius  and use Nutch office 
> parsers
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-7
>                 URL: https://issues.apache.org/jira/browse/TIKA-7
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>         Environment: Java 1.5
>            Reporter: Rida Benjelloun
>         Attachments: liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene 
> dependencies and use Nutch Office parsers because they are based on Apache 
> POI.
> Lius Lite offer 4 ways  for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information 
> to extract.  Please see config.xml in the config folder
> See also Junit tests.
> Here is an example  of XML parsing :
> 1- XML Config
>               <parser name="text-xml" class="liuslite.parser.xml.XMLParser">  
>                 
>                               
> <namespace>http://purl.org/dc/elements/1.1/</namespace>
>                               <mime>application/xml</mime>
>                               <extract>
>                                       <content name="title" 
> xpathSelect="//dc:title"/>
>                                       <content name="subject" 
> xpathSelect="//dc:subject"/>
>                                       <content name="creator" 
> xpathSelect="//dc:creator"/>
>                                       <content name="description" 
> xpathSelect="//dc:description"/>
>                                       <content name="publisher" 
> xpathSelect="//dc:publisher"/>
>                                       <content name="contributor" 
> xpathSelect="//dc:contributor"/>
>                                       <content name="type" 
> xpathSelect="//dc:type"/>
>                                       <content name="format" 
> xpathSelect="//dc:format"/>
>                                       <content name="identifier" 
> xpathSelect="//dc:identifier"/>
>                                       <content name="language" 
> xpathSelect="//dc:language"/>
>                                       <content name="rights" 
> xpathSelect="//dc:rights"/>
>                                       <content name="outLinks">
>                                               <regexSelect>
>                                                       <![CDATA[
>                                                               
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
>                                                       ]]>
>                                               </regexSelect>
>                                       </content>
>                               </extract>                      
>               </parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"; 
> xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/";>
>       <dc:title>Archimède et Lius</dc:title>
>       <dc:creator>Rida Benjelloun</dc:creator>
>       <dc:subject>Java</dc:subject>
>       <dc:subject>XML</dc:subject>
>       <dc:subject>XSLT</dc:subject>
>       <dc:subject>JDOM</dc:subject>
>       <dc:subject>Indexation</dc:subject>
>       <dc:description>Framework d'indexation des documents XML, HTML, PDF 
> etc.. </dc:description>
>       <dc:publisher>Doculibre</dc:publisher>
>       <dc:identifier>http://www.apache.org</dc:identifier>
>       <dc:date>2000-12</dc:date>
>       <dc:type>test</dc:type>
>       <dc:format>application/msword</dc:format>
>       <dc:language>Fr</dc:language>
>       <dc:rights>Non restreint</dc:rights>    
> </oaidc:dc>
> 3- Java Code 
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
>       Parser  parser = ParserFactory.getParser(testFile, lc);
>         String fullText = parser.getContentStr();
>         
>         Content title = parser.getContent("title");
>         String titleStr = title.getValue();
>         
>         Content subject = parser.getContent("subject");
>         String[] subjects = subject.getValues();
>         etc ...
>         Or : 
>         List<Content> contents = parser.getContents();
>         
>      } catch (MimeInfoException e) {
>        e.printStackTrace();
>      } catch (IOException e) {
>       e.printStackTrace();
>      } catch (LiusException e) {
>       e.printStackTrace();
>       }
> best regards
> Rida Benjelloun

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers

Reply via email to