Still haven’t opened PR, but working on the TIKA-4207 branch: https://github.com/apache/tika/tree/TIKA-4207
The initial integration will be with /pipes and /async I’ll try to add something to /v2/unpack (?) later? On Tue, Mar 12, 2024 at 4:49 PM Zig Zag <ziganda...@gmail.com> wrote: > Thats great to hear Tim, thank you!. Will definitely provide feedback. > > While this get into 3.0 officially is there something I can prototype with > /rmeta to help me get my other stuff working - any suggestions on approach > or a draft PR for the official feature would be very helpful > > On Tue, Mar 12, 2024 at 5:53 AM Tim Allison <talli...@apache.org> wrote: > >> Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207 >> >> I think I'll be wiring this into the /pipes and /async endpoints. The >> json request will specify that you want bytes AND text+metadata. >> >> There will be two options: >> a) you specify two emitters: one for json and one for raw bytes >> b) you specify one emitter, and the json and raw bytes are packaged in a >> zip >> >> I'd really appreciate feedback on the design of this feature and any help >> finding bugs! >> >> Best, >> >> Tim >> >> Cheers, >> >> Tim >> >> On Tue, Mar 12, 2024 at 3:17 AM Zig Zag <ziganda...@gmail.com> wrote: >> >>> Hi All, >>> >>> I am trying to build a pipeline that needs to process content >>> recursively and store the binary bytes of all embedded children in addition >>> to their text and other metadata. >>> >>> I was looking at two options: >>> >>> 1. using Tika's /rmeta API and having my code just call it synchronously >>> - is there a way for me to get bytes for embedded children when doing this >>> ? basically some way to smoosh together what /unpack/all does into /rmeta. >>> - if it's not built-in any guidance on extending my own recursive >>> handler to do this ?. i'd like to keep tika-server as is and just configure >>> this extension so I can keep up with updates. >>> >>> 2. using /async or /pipes - with this I had 2 questions: >>> - Is there emitter configuration to commit both bytes and text for all >>> children ? >>> - is there a way for me to pass in input with my HTTP request, and use a >>> emitter only for storage (basically some sort of fetcher that uses the >>> input request stream - this will help me avoid one external request). >>> >>> Thank you for any help!, >>> Samuel >>> >>