Start by looking at the Tika code that integrates PDFBox since that is exactly 
where you want to end up – if you want to integrate your code with Tika and 
SolrCell.

http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/
 

If you are going to replace PDFBox in Tika for SolrCell, that is one thing, but 
if you want to feed the output of your extractor directly to Solr from your own 
client application, see the Solr XML format and the SolrJ interface. 
Ultimately, your extractor will produce two things: 1) extracted content or 
body text, and 2) metadata, all of which are simply “fields” in a “Solr input 
document.”

http://wiki.apache.org/solr/UpdateXmlMessages 
http://wiki.apache.org/solr/Solrj

-- Jack Krupansky

From: Roland Ucker 
Sent: Tuesday, June 12, 2012 2:32 AM
To: [email protected] 
Subject: Text Extraction Using iText

Hello,

I would like to write my own pdf text/metadata extraction module using iText 
instead of tika/pdfbox.

Where to start? Any hints?

Regards,
Roland
 

Reply via email to