[
https://issues.apache.org/jira/browse/TIKA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504246
]
Chris A. Mattmann commented on TIKA-7:
--------------------------------------
Jukka,
Thanks for spearheading the lead on this. I think it's important to note that
patches to Tika should follow this type of standard layed out, e.g., use the
org.apache.tika namespace, make sure that unit tests are placed in the right
place, that resources are as well, etc. I am on travel right now, but I will
take a look at this patch as soon as I get back to Los Angeles.
One question I have is, have we standardized on the following issues (I know
they were discussed at ApacheCon at the BoF, as I've seen conversation on the
dev list regarding it, however, I wasn' there :) ):
1. standardization of Parser interface?
2. control flow of Tika parsers (e.g., similar to Bertrand's email
http://www.nabble.com/-RT--Tika-framework-usage-scenario-tf3913308.html)
3. major features that we want for 0.1 release
I think that these questions need to be answered before we move forward with
more code development. I realize that I've been out of the loop for a bit of
time, however, I'm starting to have some time now to get back into the loop :)
So, let's discuss. Here are my propositions for issues 1-3 above:
1. I like Bertrand's idea of a pipeline-based Tika framework. I think that the
"ContentFilter" that he proposes is essentially this Parser interface that we
are talking about. Immediate questions that come to mind are:
a. Could the ContentFilter be run in single filter mode, e.g., from the
command line? I think that a use case for Tika should be that all parsers are
executable in some fashion (even if only for testing) from the command line.
The parsed content should be returned as some form of a Metadata object, in
which the user can inspect the parsed information. Perhaps other information
should be returned as well, but that's what I thought off of the top of my head.
b. Would this pipeline model still support the use cases for Nutch, and other
initial projects that we were targeting as customers of Tika? Nutch's parse
plugins are currently more single content parsing plugins, however, I think
they could still be handled by this pipeline framework. I just want to get
everyone else's opinion on it?
2. See my questions in #1 above
3. I think that we should plan to have the following features in the 0.1
release of Tika:
a. Basic parsing capability, +1 for using pipelining, but we need to
standardize the interfaces for those/talk about architecture
b. Content Type identification (e.g., MimeType identification)
c. Basic metadata extraction capabilities
d. Limited set of known parsing of content types, e.g., HTML, and PDF
What does everyone else think?
> Lius Lite remove all lucene dependencies from Lius and use Nutch office
> parsers
> --------------------------------------------------------------------------------
>
> Key: TIKA-7
> URL: https://issues.apache.org/jira/browse/TIKA-7
> Project: Tika
> Issue Type: New Feature
> Components: general
> Environment: Java 1.5
> Reporter: Rida Benjelloun
> Attachments: liuslite.patch, liusLite.zip
>
>
> Hi,
> This is a work in progress of Lius. The release remove all Lucene
> dependencies and use Nutch Office parsers because they are based on Apache
> POI.
> Lius Lite offer 4 ways for content extraction :
> - Document fulltext extraction
> - XPath extraction
> - Regex extraction
> - Document metadata extraction (not implemented for all parsers)
> Lius Lite use an XML config file to configure the parsers and the information
> to extract. Please see config.xml in the config folder
> See also Junit tests.
> Here is an example of XML parsing :
> 1- XML Config
> <parser name="text-xml" class="liuslite.parser.xml.XMLParser">
>
>
> <namespace>http://purl.org/dc/elements/1.1/</namespace>
> <mime>application/xml</mime>
> <extract>
> <content name="title"
> xpathSelect="//dc:title"/>
> <content name="subject"
> xpathSelect="//dc:subject"/>
> <content name="creator"
> xpathSelect="//dc:creator"/>
> <content name="description"
> xpathSelect="//dc:description"/>
> <content name="publisher"
> xpathSelect="//dc:publisher"/>
> <content name="contributor"
> xpathSelect="//dc:contributor"/>
> <content name="type"
> xpathSelect="//dc:type"/>
> <content name="format"
> xpathSelect="//dc:format"/>
> <content name="identifier"
> xpathSelect="//dc:identifier"/>
> <content name="language"
> xpathSelect="//dc:language"/>
> <content name="rights"
> xpathSelect="//dc:rights"/>
> <content name="outLinks">
> <regexSelect>
> <![CDATA[
>
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
> ]]>
> </regexSelect>
> </content>
> </extract>
> </parser>
> 2- XML Document
> <oaidc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/">
> <dc:title>Archimède et Lius</dc:title>
> <dc:creator>Rida Benjelloun</dc:creator>
> <dc:subject>Java</dc:subject>
> <dc:subject>XML</dc:subject>
> <dc:subject>XSLT</dc:subject>
> <dc:subject>JDOM</dc:subject>
> <dc:subject>Indexation</dc:subject>
> <dc:description>Framework d'indexation des documents XML, HTML, PDF
> etc.. </dc:description>
> <dc:publisher>Doculibre</dc:publisher>
> <dc:identifier>http://www.apache.org</dc:identifier>
> <dc:date>2000-12</dc:date>
> <dc:type>test</dc:type>
> <dc:format>application/msword</dc:format>
> <dc:language>Fr</dc:language>
> <dc:rights>Non restreint</dc:rights>
> </oaidc:dc>
> 3- Java Code
> LiusConfig lc = LiusConfig.getInstance(configPathString);
> LiusLogger.setLoggerConfigFile(log4jPathString);
> File testFile = new File("test.xml");
> try {
> Parser parser = ParserFactory.getParser(testFile, lc);
> String fullText = parser.getContentStr();
>
> Content title = parser.getContent("title");
> String titleStr = title.getValue();
>
> Content subject = parser.getContent("subject");
> String[] subjects = subject.getValues();
> etc ...
> Or :
> List<Content> contents = parser.getContents();
>
> } catch (MimeInfoException e) {
> e.printStackTrace();
> } catch (IOException e) {
> e.printStackTrace();
> } catch (LiusException e) {
> e.printStackTrace();
> }
> best regards
> Rida Benjelloun
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.