Thanks Karl a lot. I will look into this.
> Am 03.01.2020 um 10:04 schrieb Karl Wright :
>
>
> The reason plain text is used is because otherwise standard text processing
> inside Lucene will index tags as terms, which is definitely not what you
> usually want.
>
> If you want the Tika
The reason plain text is used is because otherwise standard text processing
inside Lucene will index tags as terms, which is definitely not what you
usually want.
If you want the Tika Extractor to be able to optionally generate an XHTML
format, that sounds like an additional operating mode for
Hi,
Is there a possibility to have instead of the text output in the Tika
Extractor (Manifold version, not the extract handler) the (X)HTML output?
How one can achieve this in Tika is pretty clear:
https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
Reason: We need to