Re: Tika Extractor - extract document as (X)HTML not as textonly

2020-01-03 Thread Jörn Franke
Thanks Karl a lot. I will look into this. > Am 03.01.2020 um 10:04 schrieb Karl Wright : > >  > The reason plain text is used is because otherwise standard text processing > inside Lucene will index tags as terms, which is definitely not what you > usually want. > > If you want the Tika

Re: Tika Extractor - extract document as (X)HTML not as textonly

2020-01-03 Thread Karl Wright
The reason plain text is used is because otherwise standard text processing inside Lucene will index tags as terms, which is definitely not what you usually want. If you want the Tika Extractor to be able to optionally generate an XHTML format, that sounds like an additional operating mode for

Tika Extractor - extract document as (X)HTML not as textonly

2020-01-02 Thread Jörn Franke
Hi, Is there a possibility to have instead of the text output in the Tika Extractor (Manifold version, not the extract handler) the (X)HTML output? How one can achieve this in Tika is pretty clear: https://tika.apache.org/1.8/examples.html#Picking_different_output_formats Reason: We need to