You can have a look here: http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/
2013/10/10 Peter Bleackley <bleackl...@zooey.co.uk> > I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can > get Solr to ingest the entire document as one long string, stored in the > index as "content". However, I want to index structure within the documents. > > I know that the ExtractingRequestHandler uses Apache Tika to convert the > documents to XHTML. I've used the Tika GUI to look at the XHTML > representation, and I can see that each page is represented as a <div> > element, and that structure within pages is represented by <p> elements. > How do I configure Solr to index documents at this level of granularity? > > Dr Peter J Bleackley > Computational Linguistics Contractor > Playful Technology Ltd >