Re: Tika Extractor - extract document as (X)HTML not as textonly

Jörn Franke Fri, 03 Jan 2020 01:58:37 -0800

Thanks Karl a lot. I will look into this.


> Am 03.01.2020 um 10:04 schrieb Karl Wright <[email protected]>:
> 
> 
> The reason plain text is used is because otherwise standard text processing 
> inside Lucene will index tags as terms, which is definitely not what you 
> usually want.
> 
> If you want the Tika Extractor to be able to optionally generate an XHTML 
> format, that sounds like an additional operating mode for the Tika Extractor. 
>  To do that you'd need to add a flag, probably to the Output Specification, 
> with associated UI components, and be sure to maintain backwards 
> compatibility.
> 
> Karl
> 
> 
>> On Thu, Jan 2, 2020 at 4:30 PM Jörn Franke <[email protected]> wrote:
>> Hi,
>> 
>> Is there a possibility to have instead of the text output in the Tika 
>> Extractor (Manifold version, not the extract handler) the (X)HTML output? 
>> How one can achieve this in Tika is pretty clear:
>> https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
>> 
>> Reason: We need to extract very specific chapters from a word document and 
>> index them as dedicated Solr documents (the latter part is probably still to 
>> be done in an update chain).  There we currently already extract from the 
>> HTML version created by Tika of the word document the (sub-)chapters we need.
>> 
>> thank you.
>> 
>> best regards

Re: Tika Extractor - extract document as (X)HTML not as textonly

Reply via email to