Start by looking at the Tika code that integrates PDFBox since that is exactly where you want to end up – if you want to integrate your code with Tika and SolrCell.
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/ If you are going to replace PDFBox in Tika for SolrCell, that is one thing, but if you want to feed the output of your extractor directly to Solr from your own client application, see the Solr XML format and the SolrJ interface. Ultimately, your extractor will produce two things: 1) extracted content or body text, and 2) metadata, all of which are simply “fields” in a “Solr input document.” http://wiki.apache.org/solr/UpdateXmlMessages http://wiki.apache.org/solr/Solrj -- Jack Krupansky From: Roland Ucker Sent: Tuesday, June 12, 2012 2:32 AM To: [email protected] Subject: Text Extraction Using iText Hello, I would like to write my own pdf text/metadata extraction module using iText instead of tika/pdfbox. Where to start? Any hints? Regards, Roland
