date:20200103

Re: Tika Extractor - extract document as (X)HTML not as textonly

2020-01-03 Thread Jörn Franke

Thanks Karl a lot. I will look into this. > Am 03.01.2020 um 10:04 schrieb Karl Wright : > > > The reason plain text is used is because otherwise standard text processing > inside Lucene will index tags as terms, which is definitely not what you > usually want. > > If you want the Tika

Re: Tika Extractor - extract document as (X)HTML not as textonly

2020-01-03 Thread Karl Wright

The reason plain text is used is because otherwise standard text processing inside Lucene will index tags as terms, which is definitely not what you usually want. If you want the Tika Extractor to be able to optionally generate an XHTML format, that sounds like an additional operating mode for