Hello Tim,

Thank you for the clarification. I have decided to stick to the pipes
endpoint for now, because I can use the json response from tika api to at
least detect when file X did not process properly. Will definitely explore
the Pipes reporter and the unit tests, if I have something worthy
documenting in the end I will give you a shout on here.

Many thanks,
Georgi

On Tue, 3 Oct 2023, 20:38 Tim Allison, <talli...@apache.org> wrote:

> I'm sorry for my delay.
>
> At some point, I was thinking about implementing: /async/<task_id>  but I
> gave up. The problem was that I didn't want to have to tie caching/storing
> status info into tika-server or the async processor -- so I created a
> configurable PipesReporter class...see below.
>
> If you set up logging carefully, and I need to document this better, you
> can get a log per async subprocess which will include that "id" key and any
> assorted stacktraces that were caught during the parse.  Those logs will
> not tell you when a subprocess timed out or crashed, but they do offer rich
> information.
>
> The other method (and I also have to document this better) is to specify a
> PipesReporter in the async config section of the tika-config.xml file. The
> pipes-reporter sits in the root process and is aware of both parse
> exceptions and fatal crashes/timeouts.  I've implemented a couple:
> JDBCPipesReporter (which I used quite a bit on some large processing jobs
> with postgresql) and there's the LoggingPipesReporter.
>
> If you let me know which direction you'd like to head, I can focus on
> documenting that portion. :D
>
> Or if you figure out what you need from our repo's unit tests and want to
> document your findings, we'd be happy to have help with the documentation!
>
> Best,
>
>             Tim
>
>
>
> On Thu, Sep 28, 2023 at 7:08 AM Georgi Nikolov <nikolov3...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I have been stuck on this for too long now I feel like, so I decided to
>> try to get some information here.
>>
>> I would need Tika to extract content and metadata from thousands of files
>> from S3. What I wanted to do is, have Tika running as a standalone server
>> and use S3 fetchers and emitters in conjunction with  /async
>>
>> However I am having some difficulties to track what is going on the
>> server side, my client code is in python using `python-tika`
>>
>> A payload is built programmatically and sent to /async endpoint, but I
>> need to be able to track either the whole async task or individual tuples
>> from it - have they failed, succeeded, or still running but I am
>> struggling to achieve that, could not found any related information on
>> whether this is possible, came across some information that it can be
>> achieved by checking `/tika/async/<task_id>  and that when you send a put
>> the response should contain X-Tika-id header, but none of these seem to
>> work. Additionally from confluence:
>>
>> As default the fetchKey is used as the id for logging.  However, if
>>> users need a distinct task id for the request, they may add an id
>>>  element:
>>>
>>> {
>>>     "id": "myTaskId",
>>>     "fetcher": "fsf",
>>>     "fetchKey": "hello_world.pdf",
>>>     "emitter": "fse",
>>>     "emitKey": "hello_world.pdf.json"
>>> }
>>>
>>>
>>>
>> Is there a way to track the task id when running from /async looks like
>> there is from all that I have seen so far but can't seem to figure out how
>> to actually achieve it, if i try to do `GET` on /async/<task_id> nothing
>> happens - no resource, if i try to use /tika/async/<task_id> I get a 405.
>>
>> I have tried using /pipes which would capture errors etc and is handy but
>> what about async ?
>> /async doesn't seem to throw any errors no matter what actually happens
>> in tika as long the payload is valid e.g processing errors, or bad cred
>> errors for aws everything just gets skipped.
>>
>> Any pointers in the right direction will be welcome.
>>
>> Thanks,
>> Georgi
>>
>

Reply via email to