Re: [PR] Add OCR encode parser module [tika]

via GitHub Thu, 16 Apr 2026 08:29:03 -0700


tballison commented on PR #2769:
URL: https://github.com/apache/tika/pull/2769#issuecomment-4261293428


   At a high level, we've added vlm inference hooks in 4.x: 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java
   
   And we also have vlm parsers with a "give me all the text" prompt that 
should yield similar results to OCR: 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java
   
   The other thing we've added is recursive embedded file extraction so that 
you can get the List<Metadata> back and aim an emitter at a file share or s3, 
and Tika will write the bytes for embedded files there. You can configure it to 
output only images (I think?).
   
   I understand that you might want to do post-processing/inference at a 
different stage, though, and this looks decent to me on a quick glance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add OCR encode parser module [tika]

Reply via email to