Re: tika/pdfbox knobs & levers

Lance Norskog Thu, 14 Apr 2011 18:32:07 -0700

Tika creates document-level metadata and text from the input file.
That's it. If you want to use PDFbox directly, you need your own Solr
plugin.


On 4/13/11, Markus Jelsma <markus.jel...@openindex.io> wrote:
> Hi,
>
> I'm not sure how Solr allows for adjusting these Tika settings to get the
> desired output. At least a few desirable Tika subsystems cannot be called
> from
> the ExtractingRequestHandler such as Tika's BoilerPlateContentHandler. I'm
> also not really sure if it's a good idea to normalize diacritics in Tika
> output, this way the stored data would also be normalized which is not
> desirable.
>
> You can, however, normalize diacritics in your field analyzer. This way your
> search is normalized but the returned data still holds diacritics which is
> good.
>
> Cheers,
>
>> Hi all,
>>
>> I'm wondering if there are any knobs or levers i can set in
>> solrconfig.xml that affect how pdfbox text extraction is performed by
>> the extraction handler. I would like to take advantage of pdfbox's
>> ability to normalize diacritics and ligatures [1], but that doesn't
>> seem to be the default behavior. Is there a way to enable this?
>>
>> Thanks,
>> --jay
>>
>> [1]
>> http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNor
>> malize.html
>


-- 
Lance Norskog
goks...@gmail.com

Re: tika/pdfbox knobs & levers

Reply via email to