[
https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504153
]
Jukka Zitting commented on TIKA-7:
----------------------------------
Looks good! Some comments:
* Could you move the code to the org.apache.tika.* namespace?
* I don't think we need binaries at this point, so you could just upload the
sources.
* Do you have a (ant,maven,etc.) build script for the code? I guess we should
integrate the build with the current Maven 2 setup in Tika.
* (personal wish) Would it be possible to use spaces for indentation?
> Lius Lite remove all lucene dependencies from Lius and use Nutch office
> parsers
> --------------------------------------------------------------------------------
>
> Key: TIKA-7
> URL: https://issues.apache.org/jira/browse/TIKA-7
> Project: Tika
> Issue Type: New Feature
> Components: general
> Environment: Java 1.5
> Reporter: Rida Benjelloun
> Attachments: liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene
> dependencies and use Nutch Office parsers because they are based on Apache
> POI.
> Lius Lite offer 4 ways for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information
> to extract. Please see config.xml in the config folder
> See also Junit tests.
> Here is an example of XML parsing :
> 1- XML Config
> <parser name="text-xml" class="liuslite.parser.xml.XMLParser">
>
>
> <namespace>http://purl.org/dc/elements/1.1/</namespace>
> <mime>application/xml</mime>
> <extract>
> <content name="title"
> xpathSelect="//dc:title"/>
> <content name="subject"
> xpathSelect="//dc:subject"/>
> <content name="creator"
> xpathSelect="//dc:creator"/>
> <content name="description"
> xpathSelect="//dc:description"/>
> <content name="publisher"
> xpathSelect="//dc:publisher"/>
> <content name="contributor"
> xpathSelect="//dc:contributor"/>
> <content name="type"
> xpathSelect="//dc:type"/>
> <content name="format"
> xpathSelect="//dc:format"/>
> <content name="identifier"
> xpathSelect="//dc:identifier"/>
> <content name="language"
> xpathSelect="//dc:language"/>
> <content name="rights"
> xpathSelect="//dc:rights"/>
> <content name="outLinks">
> <regexSelect>
> <![CDATA[
>
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
> ]]>
> </regexSelect>
> </content>
> </extract>
> </parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/">
> <dc:title>Archimède et Lius</dc:title>
> <dc:creator>Rida Benjelloun</dc:creator>
> <dc:subject>Java</dc:subject>
> <dc:subject>XML</dc:subject>
> <dc:subject>XSLT</dc:subject>
> <dc:subject>JDOM</dc:subject>
> <dc:subject>Indexation</dc:subject>
> <dc:description>Framework d'indexation des documents XML, HTML, PDF
> etc.. </dc:description>
> <dc:publisher>Doculibre</dc:publisher>
> <dc:identifier>http://www.apache.org</dc:identifier>
> <dc:date>2000-12</dc:date>
> <dc:type>test</dc:type>
> <dc:format>application/msword</dc:format>
> <dc:language>Fr</dc:language>
> <dc:rights>Non restreint</dc:rights>
> </oaidc:dc>
> 3- Java Code
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
> Parser parser = ParserFactory.getParser(testFile, lc);
> String fullText = parser.getContentStr();
>
> Content title = parser.getContent("title");
> String titleStr = title.getValue();
>
> Content subject = parser.getContent("subject");
> String[] subjects = subject.getValues();
> etc ...
> Or :
> List<Content> contents = parser.getContents();
>
> } catch (MimeInfoException e) {
> e.printStackTrace();
> } catch (IOException e) {
> e.printStackTrace();
> } catch (LiusException e) {
> e.printStackTrace();
> }
> best regards
> Rida Benjelloun
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.