tballison commented on PR #2769: URL: https://github.com/apache/tika/pull/2769#issuecomment-4261293428
At a high level, we've added vlm inference hooks in 4.x: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java And we also have vlm parsers with a "give me all the text" prompt that should yield similar results to OCR: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java The other thing we've added is recursive embedded file extraction so that you can get the List<Metadata> back and aim an emitter at a file share or s3, and Tika will write the bytes for embedded files there. You can configure it to output only images (I think?). I understand that you might want to do post-processing/inference at a different stage, though, and this looks decent to me on a quick glance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
