Thanks Tim/Josh, My use case is to recursively get all 3 things - text (as plain text), metadata and bytes - I made a mistake /rmeta/all was wrong and /rmeta/text does what I need (I thought it wouldn't output meta and only output text, but I see everything is still there) - sorry.
Thanks for working on 4207 I am trying to adapt my pipeline to work asynchronously but there are some architecture issues with my own code/infrastructure, if you create a synchronous endpoint I'd suggest /runpack (for recursive unpack to go with rmeta) Thank you, Samuel On Thu, Mar 21, 2024 at 3:14 PM Tim Allison <[email protected]> wrote: > I’m making progress on TIKA-4207, which will allow you to specify separate > emitters for /rmeta like output and a separate emitter for the raw bytes > from all embedded files. > > That uses the /pipes or /async endpoints. > > After I finish that, I’ll try to add another endpoint that returns a zip > with embedded raw bytes and the rmeta content. > > Not sure what to call that endpoint. Recommendations? > > On Thu, Mar 21, 2024 at 6:10 PM Tim Allison <[email protected]> wrote: > >> If rmeta/text is not returning text extracted from embedded files that’s >> a bug. >> >> I don’t think /rmeta/all is a thing. >> >> On Thu, Mar 21, 2024 at 5:21 PM Zig Zag <[email protected]> wrote: >> >>> Thanks Josh, thats correct but rmeta/text allows you to control this but >>> it only returns one level of text (not documents embedded within others) - >>> when you use the recursive interface rmeta/all it always returns content as >>> HTML and similarly unpack/all returns meta as CSV. >>> >>> On Thu, Mar 21, 2024 at 1:40 PM Josh Burchard <[email protected]> >>> wrote: >>> >>>> Samuel - Well, I use Tika server and I get my data back in JSON format >>>> because I use the /rmeta/text endpoint and send the HTTP header >>>> Accept:application/json. If you were to send Accept:text/plain would that >>>> work for you? I've only done that in the context of the /tika endpoint and >>>> that was long ago. Not sure how to do anything similar in the app because >>>> I never use that. By the way, in the context of using the server I find >>>> this table very helpful: >>>> >>>> >>>> https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> From: "Zig Zag" <[email protected]> >>>> To: [email protected] >>>> Date: 03/21/2024 03:49 PM >>>> Subject: Re: Meta output format of tika server /unpack/all >>>> ------------------------------ >>>> >>>> >>>> >>>> [CAUTION: This email is from outside the organization. Unless you trust >>>> the sender, don't click links or open attachments as it may be a phishing >>>> email, which can steal your information and compromise your computer.] >>>> >>>> >>>> Similarly is it possible to have /rmeta/all format content/text as text >>>> instead of HTML? >>>> >>>> On Thu, Mar 21, 2024 at 9:50 AM Zig Zag <*[email protected]* >>>> <[email protected]>> wrote: >>>> Hi All, >>>> >>>> Is there a way to get the __META__ output of /unpack/all in a JSON >>>> rather than CSV ? >>>> >>>> Thank you, >>>> Samuel >>>> >>>>
