Tika creates document-level metadata and text from the input file. That's it. If you want to use PDFbox directly, you need your own Solr plugin.
On 4/13/11, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hi, > > I'm not sure how Solr allows for adjusting these Tika settings to get the > desired output. At least a few desirable Tika subsystems cannot be called > from > the ExtractingRequestHandler such as Tika's BoilerPlateContentHandler. I'm > also not really sure if it's a good idea to normalize diacritics in Tika > output, this way the stored data would also be normalized which is not > desirable. > > You can, however, normalize diacritics in your field analyzer. This way your > search is normalized but the returned data still holds diacritics which is > good. > > Cheers, > >> Hi all, >> >> I'm wondering if there are any knobs or levers i can set in >> solrconfig.xml that affect how pdfbox text extraction is performed by >> the extraction handler. I would like to take advantage of pdfbox's >> ability to normalize diacritics and ligatures [1], but that doesn't >> seem to be the default behavior. Is there a way to enable this? >> >> Thanks, >> --jay >> >> [1] >> http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNor >> malize.html > -- Lance Norskog goks...@gmail.com