You may be able to use Tika directly without needing to choose the specific classes, although the latter may give you the specific data you need without the extra overhead.

You could take a look at the Solr Extracting Request Handler source for an example:
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/

Basically, Tika extracts a bunch of "metadata" and then you will have to add selected metadata to your Lucene documents. "content" is the main document body text.

You could try Solr itself to see how it works:
http://wiki.apache.org/solr/ExtractingRequestHandler

-- Jack Krupansky

-----Original Message----- From: Adrien Grand
Sent: Sunday, January 27, 2013 12:53 PM
To: [email protected]
Subject: Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?

This question seems to be more appropriate for the Tika user mailing list [3]?

[1] http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[2] http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,
org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to