zamf commented on PR #2769: URL: https://github.com/apache/tika/pull/2769#issuecomment-4261411334
> At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java > > And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java > > The other thing we've added is recursive embedded file extraction so that you can get the List<Metadata>/json back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?). > > I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance. Interesting, I was not aware. Sounds like you have done a lot of work to enable image processing outside. I wanted something that 1. allows post-processing at a different stage 2. does not use http, so that it frees up the memory quickly and does not have to wait for a vlm to reply. 3. is order-preserving, so that it is easy to re-assamble the full document later 4. does not rely on Tika having access to storage. What I found is that base64 encoding is still not as fast as I would have hoped, so if I could change the text interface to reply with protobuf, that would be a substantial speed improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
