I updated our documentation on our wiki: https://cwiki.apache.org/confluence/display/TIKA/TikaServer
It looks like we had already documented the maxEmbeddedResources for /rmeta, but the documentation now covers X-Tika-Skip-Embedded as well. On Fri, Jul 14, 2023 at 11:57 AM Tim Allison <talli...@apache.org> wrote: > Two follow ups... > > 1) TIKA-3227 was Dave Meikle's addition to skip embedded for /tika. Add a > header X-Tika-Skip-Embedded with value 'true'. > 2) You can get just the text content with /rmeta via /rmeta/text > > > On Thu, Jul 13, 2023 at 4:30 PM Tim Allison <talli...@apache.org> wrote: > >> Sorry for my delay. >> For /tika, I thought we had a way to tell it to parse only the primary >> document and skip the attachments, but I can't figure out how to do that >> quickly now. I'll look around some more. >> >> With /rmeta, try setting a header `maxEmbeddedResources:0` >> >> On Fri, Jul 7, 2023 at 5:06 AM Willy T. Koch <t...@kochkonsult.no> wrote: >> >>> Hi, >>> We're using /tika Docker endpoint with text/plain to extract file >>> content for indexing in Elastic. >>> >>> If I have a 10 Mb .msg file with 10 .docx and PDF attachments. I only >>> need to extract the text from the .msg body, not any of the attachments, as >>> these are extracted from the .msg and handled separately. It now times out >>> since it's massive amounts of text to process. >>> >>> I can't find any good examples for this for /tika, even on the excellent >>> wiki at https://cwiki.apache.org/confluence/display/TIKA/TikaServer >>> >>> Everything I try results in the attachments being part of the file >>> extraction output. >>> I see there is a POST to /tika/form/main which sounds promising, but I >>> can't get that to work. >>> >>> Using /rmeta does it as JSON/html, but we ideally only need the file >>> content as plain text. >>> >>> Any ideas would be greatly appreciated! >>> >>> Regards, >>> Willy Koch >>> >>>