Thanks Karl a lot. I will look into this.
> Am 03.01.2020 um 10:04 schrieb Karl Wright :
>
>
> The reason plain text is used is because otherwise standard text processing
> inside Lucene will index tags as terms, which is definitely not what you
> usually want.
>
> If you want the Tika Extractor to be able to optionally generate an XHTML
> format, that sounds like an additional operating mode for the Tika Extractor.
> To do that you'd need to add a flag, probably to the Output Specification,
> with associated UI components, and be sure to maintain backwards
> compatibility.
>
> Karl
>
>
>> On Thu, Jan 2, 2020 at 4:30 PM Jörn Franke wrote:
>> Hi,
>>
>> Is there a possibility to have instead of the text output in the Tika
>> Extractor (Manifold version, not the extract handler) the (X)HTML output?
>> How one can achieve this in Tika is pretty clear:
>> https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
>>
>> Reason: We need to extract very specific chapters from a word document and
>> index them as dedicated Solr documents (the latter part is probably still to
>> be done in an update chain). There we currently already extract from the
>> HTML version created by Tika of the word document the (sub-)chapters we need.
>>
>> thank you.
>>
>> best regards