janhoy commented on PR #3670:
URL: https://github.com/apache/solr/pull/3670#issuecomment-3337472814

   Last commit adds recursive parsing as an option `&recursive=true` for 
backend `tikaserver`. I moved the failing image test to the Local test class, 
since there is a difference in how PDFs with images are parsed in 1.x and 3.x. 
In 1.x embedded images would leave traces in the extracted text by default. Not 
anymore. But I added a test to the tikaserver backend test class that extracts 
the same pdf, with recursive enabled and also a special HTTP header to 
TikaServer `X-Tika-PDFextractInlineImages=true` which indeed extracts the 
images an adds their names to the text. This recursive business is highly 
experimental, it's a different endpoint `/rmeta`, returning JSON which needs to 
be buffered entirely in memory both on TikaServer side and on Solr side. Also 
all the embedded documents are returned concatenated together in the content, 
and I believe that the metadata from the main object is the only retained?
   
   All tests are now green. However, there is still a thread leak in the 
tikaserver test. I think there are some HttpClient stuff not released.
   
   Other TODO:
   
   - [ ]  Lots of code from GenAI, which needs review and rewrite / 
simplification.
   - [ ] There may be debug print and TODOs left here and there
   - [ ] The back-compat metadata map is AI generated and by no means complete 
or even correct 🤣
   - [ ] Lack of JavaDoc everywhere
   - [ ] Perhaps some code is Java21, making backport a challenge
   - [ ] Not much attention has been given to exception handling, retrying, 
timeout values etc
   - [ ] Should probably use Jetty HTTP client.
   - [ ] Throughput with many update requests? Right now the request thread 
will be blocking on Tika response and parsing..
   - [ ] If TikaServer supports HTTPS we'd probably need to handle self-signed 
SSL either through truststore, or custom config like we did for JWT-auth.
   - [ ] If TIkaServer supports Auth, that must be thought of
   - [ ] Complete refGuide docs
   - [ ] Split the huge PR into stages, i.e. first only support for pluggable 
backend, then add the tikaserver backend
   
   That concludes the "POC", proving that it is doable to do a drop-in 
replacement for users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to