I updated our documentation on our wiki:
https://cwiki.apache.org/confluence/display/TIKA/TikaServer

It looks like we had already documented the maxEmbeddedResources for
/rmeta, but the documentation now covers X-Tika-Skip-Embedded as well.


On Fri, Jul 14, 2023 at 11:57 AM Tim Allison <talli...@apache.org> wrote:

> Two follow ups...
>
> 1) TIKA-3227 was Dave Meikle's addition to skip embedded for /tika.  Add a
> header X-Tika-Skip-Embedded with value 'true'.
> 2) You can get just the text content with /rmeta via /rmeta/text
>
>
> On Thu, Jul 13, 2023 at 4:30 PM Tim Allison <talli...@apache.org> wrote:
>
>> Sorry for my delay.
>> For /tika, I thought we had a way to tell it to parse only the primary
>> document and skip the attachments, but I can't figure out how to do that
>> quickly now.  I'll look around some more.
>>
>> With /rmeta, try setting a header `maxEmbeddedResources:0`
>>
>> On Fri, Jul 7, 2023 at 5:06 AM Willy T. Koch <t...@kochkonsult.no> wrote:
>>
>>> Hi,
>>> We're using /tika Docker endpoint with text/plain to extract file
>>> content for indexing in Elastic.
>>>
>>> If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only
>>> need to extract the text from the .msg body, not any of the attachments, as
>>> these are extracted from the .msg and handled separately. It now times out
>>> since it's massive amounts of text to process.
>>>
>>> I can't find any good examples for this for /tika, even on the excellent
>>> wiki at  https://cwiki.apache.org/confluence/display/TIKA/TikaServer
>>>
>>> Everything I try results in the attachments being part of the file
>>> extraction output.
>>> I see there is a POST to /tika/form/main which sounds promising, but I
>>> can't get that to work.
>>>
>>> Using /rmeta does it as JSON/html, but we ideally only need the file
>>> content as plain text.
>>>
>>> Any ideas would be greatly appreciated!
>>>
>>> Regards,
>>> Willy Koch
>>>
>>>

Reply via email to