Thanks Karl a lot. I will look into this.
> Am 03.01.2020 um 10:04 schrieb Karl Wright <[email protected]>: > > > The reason plain text is used is because otherwise standard text processing > inside Lucene will index tags as terms, which is definitely not what you > usually want. > > If you want the Tika Extractor to be able to optionally generate an XHTML > format, that sounds like an additional operating mode for the Tika Extractor. > To do that you'd need to add a flag, probably to the Output Specification, > with associated UI components, and be sure to maintain backwards > compatibility. > > Karl > > >> On Thu, Jan 2, 2020 at 4:30 PM Jörn Franke <[email protected]> wrote: >> Hi, >> >> Is there a possibility to have instead of the text output in the Tika >> Extractor (Manifold version, not the extract handler) the (X)HTML output? >> How one can achieve this in Tika is pretty clear: >> https://tika.apache.org/1.8/examples.html#Picking_different_output_formats >> >> Reason: We need to extract very specific chapters from a word document and >> index them as dedicated Solr documents (the latter part is probably still to >> be done in an update chain). There we currently already extract from the >> HTML version created by Tika of the word document the (sub-)chapters we need. >> >> thank you. >> >> best regards
