Tika server mode - /async task tracking

Georgi Nikolov Thu, 28 Sep 2023 04:08:51 -0700

Hello,

I have been stuck on this for too long now I feel like, so I decided to try
to get some information here.


I would need Tika to extract content and metadata from thousands of files
from S3. What I wanted to do is, have Tika running as a standalone server
and use S3 fetchers and emitters in conjunction with  /async

However I am having some difficulties to track what is going on the server
side, my client code is in python using `python-tika`

A payload is built programmatically and sent to /async endpoint, but I need
to be able to track either the whole async task or individual tuples from
it - have they failed, succeeded, or still running but I am struggling to
achieve that, could not found any related information on whether this is
possible, came across some information that it can be achieved by checking
`/tika/async/<task_id>  and that when you send a put the response should
contain X-Tika-id header, but none of these seem to work. Additionally from
confluence:

As default the fetchKey is used as the id for logging.  However, if users
> need a distinct task id for the request, they may add an id element:
>
> {
>     "id": "myTaskId",
>     "fetcher": "fsf",
>     "fetchKey": "hello_world.pdf",
>     "emitter": "fse",
>     "emitKey": "hello_world.pdf.json"
> }
>
>
>
Is there a way to track the task id when running from /async looks like
there is from all that I have seen so far but can't seem to figure out how
to actually achieve it, if i try to do `GET` on /async/<task_id> nothing
happens - no resource, if i try to use /tika/async/<task_id> I get a 405.

I have tried using /pipes which would capture errors etc and is handy but
what about async ?
/async doesn't seem to throw any errors no matter what actually happens in
tika as long the payload is valid e.g processing errors, or bad cred errors
for aws everything just gets skipped.

Any pointers in the right direction will be welcome.

Thanks,
Georgi

Tika server mode - /async task tracking

Reply via email to