Still haven’t opened PR, but working on the TIKA-4207 branch:
https://github.com/apache/tika/tree/TIKA-4207

The initial integration will be with /pipes and /async

I’ll try to add something to /v2/unpack (?) later?

On Tue, Mar 12, 2024 at 4:49 PM Zig Zag <ziganda...@gmail.com> wrote:

> Thats great to hear Tim, thank you!. Will definitely provide feedback.
>
> While this get into 3.0 officially is there something I can prototype with
> /rmeta to help me get my other stuff working - any suggestions on approach
> or a draft PR for the official feature would be very helpful
>
> On Tue, Mar 12, 2024 at 5:53 AM Tim Allison <talli...@apache.org> wrote:
>
>> Stay tuned! Coming soon: https://issues.apache.org/jira/browse/TIKA-4207
>>
>> I think I'll be wiring this into the /pipes and /async endpoints. The
>> json request will specify that you want bytes AND text+metadata.
>>
>> There will be two options:
>> a) you specify two emitters: one for json and one for raw bytes
>> b) you specify one emitter, and the json and raw bytes are packaged in a
>> zip
>>
>> I'd really appreciate feedback on the design of this feature and any help
>> finding bugs!
>>
>> Best,
>>
>>         Tim
>>
>> Cheers,
>>
>>         Tim
>>
>> On Tue, Mar 12, 2024 at 3:17 AM Zig Zag <ziganda...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am trying to build a pipeline that needs to process content
>>> recursively and store the binary bytes of all embedded children in addition
>>> to their text and other metadata.
>>>
>>>  I was looking at two options:
>>>
>>> 1. using Tika's /rmeta API and having my code just call it synchronously
>>> - is there a way for me to get bytes for embedded children when doing this
>>> ? basically some way to smoosh together what /unpack/all does into /rmeta.
>>> -   if it's not built-in any guidance on extending my own recursive
>>> handler to do this ?. i'd like to keep tika-server as is and just configure
>>> this extension so I can keep up with updates.
>>>
>>> 2. using /async or /pipes - with this I had 2 questions:
>>> - Is there emitter configuration to commit both bytes and text for all
>>> children ?
>>> - is there a way for me to pass in input with my HTTP request, and use a
>>> emitter only for storage (basically some sort of fetcher that uses the
>>> input request stream - this will help me avoid one external request).
>>>
>>> Thank you for any help!,
>>> Samuel
>>>
>>

Reply via email to