Re: [PR] Add OCR encode parser module [tika]

via GitHub Thu, 16 Apr 2026 08:45:49 -0700


zamf commented on PR #2769:
URL: https://github.com/apache/tika/pull/2769#issuecomment-4261411334


   > At a high level, we've added vlm inference hooks in 4.x: 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-inference/src/main/java/org/apache/tika/inference/OpenAIImageEmbeddingParser.java
   > 
   > And we also have vlm parsers with a "give me all the text" prompt that 
should yield similar results to OCR: 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-vlm/src/main/java/org/apache/tika/parser/vlm/OpenAIVLMParser.java
   > 
   > The other thing we've added is recursive embedded file extraction so that 
you can get the List<Metadata>/json back and aim an emitter at a file share or 
s3, and Tika will write the bytes for embedded files there. You can configure 
it to output only images (I think?).
   > 
   > I understand that you might want to do post-processing/inference at a 
different stage, though, and this looks decent to me on a quick glance.
   
   Interesting, I was not aware. Sounds like you have done a lot of work to 
enable image processing outside.
   I wanted something that 
   1. allows post-processing at a different stage
   2. does not use http, so that it frees up the memory quickly and does not 
have to wait for a vlm to reply. 
   3. is order-preserving, so that it is easy to re-assamble the full document 
later
   4. does not rely on Tika having access to storage.
   
   What I found is that base64 encoding is still not as fast as I would have 
hoped, so if I could change the text interface to reply with protobuf, that 
would be a substantial speed improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add OCR encode parser module [tika]

Reply via email to