Hello Tim, Thank you for the clarification. I have decided to stick to the pipes endpoint for now, because I can use the json response from tika api to at least detect when file X did not process properly. Will definitely explore the Pipes reporter and the unit tests, if I have something worthy documenting in the end I will give you a shout on here.
Many thanks, Georgi On Tue, 3 Oct 2023, 20:38 Tim Allison, <talli...@apache.org> wrote: > I'm sorry for my delay. > > At some point, I was thinking about implementing: /async/<task_id> but I > gave up. The problem was that I didn't want to have to tie caching/storing > status info into tika-server or the async processor -- so I created a > configurable PipesReporter class...see below. > > If you set up logging carefully, and I need to document this better, you > can get a log per async subprocess which will include that "id" key and any > assorted stacktraces that were caught during the parse. Those logs will > not tell you when a subprocess timed out or crashed, but they do offer rich > information. > > The other method (and I also have to document this better) is to specify a > PipesReporter in the async config section of the tika-config.xml file. The > pipes-reporter sits in the root process and is aware of both parse > exceptions and fatal crashes/timeouts. I've implemented a couple: > JDBCPipesReporter (which I used quite a bit on some large processing jobs > with postgresql) and there's the LoggingPipesReporter. > > If you let me know which direction you'd like to head, I can focus on > documenting that portion. :D > > Or if you figure out what you need from our repo's unit tests and want to > document your findings, we'd be happy to have help with the documentation! > > Best, > > Tim > > > > On Thu, Sep 28, 2023 at 7:08 AM Georgi Nikolov <nikolov3...@gmail.com> > wrote: > >> Hello, >> >> I have been stuck on this for too long now I feel like, so I decided to >> try to get some information here. >> >> I would need Tika to extract content and metadata from thousands of files >> from S3. What I wanted to do is, have Tika running as a standalone server >> and use S3 fetchers and emitters in conjunction with /async >> >> However I am having some difficulties to track what is going on the >> server side, my client code is in python using `python-tika` >> >> A payload is built programmatically and sent to /async endpoint, but I >> need to be able to track either the whole async task or individual tuples >> from it - have they failed, succeeded, or still running but I am >> struggling to achieve that, could not found any related information on >> whether this is possible, came across some information that it can be >> achieved by checking `/tika/async/<task_id> and that when you send a put >> the response should contain X-Tika-id header, but none of these seem to >> work. Additionally from confluence: >> >> As default the fetchKey is used as the id for logging. However, if >>> users need a distinct task id for the request, they may add an id >>> element: >>> >>> { >>> "id": "myTaskId", >>> "fetcher": "fsf", >>> "fetchKey": "hello_world.pdf", >>> "emitter": "fse", >>> "emitKey": "hello_world.pdf.json" >>> } >>> >>> >>> >> Is there a way to track the task id when running from /async looks like >> there is from all that I have seen so far but can't seem to figure out how >> to actually achieve it, if i try to do `GET` on /async/<task_id> nothing >> happens - no resource, if i try to use /tika/async/<task_id> I get a 405. >> >> I have tried using /pipes which would capture errors etc and is handy but >> what about async ? >> /async doesn't seem to throw any errors no matter what actually happens >> in tika as long the payload is valid e.g processing errors, or bad cred >> errors for aws everything just gets skipped. >> >> Any pointers in the right direction will be welcome. >> >> Thanks, >> Georgi >> >