janhoy commented on PR #3670: URL: https://github.com/apache/solr/pull/3670#issuecomment-3337472814
Last commit adds recursive parsing as an option `&recursive=true` for backend `tikaserver`. I moved the failing image test to the Local test class, since there is a difference in how PDFs with images are parsed in 1.x and 3.x. In 1.x embedded images would leave traces in the extracted text by default. Not anymore. But I added a test to the tikaserver backend test class that extracts the same pdf, with recursive enabled and also a special HTTP header to TikaServer `X-Tika-PDFextractInlineImages=true` which indeed extracts the images an adds their names to the text. This recursive business is highly experimental, it's a different endpoint `/rmeta`, returning JSON which needs to be buffered entirely in memory both on TikaServer side and on Solr side. Also all the embedded documents are returned concatenated together in the content, and I believe that the metadata from the main object is the only retained? All tests are now green. However, there is still a thread leak in the tikaserver test. I think there are some HttpClient stuff not released. Other TODO: - [ ] Lots of code from GenAI, which needs review and rewrite / simplification. - [ ] There may be debug print and TODOs left here and there - [ ] The back-compat metadata map is AI generated and by no means complete or even correct 🤣 - [ ] Lack of JavaDoc everywhere - [ ] Perhaps some code is Java21, making backport a challenge - [ ] Not much attention has been given to exception handling, retrying, timeout values etc - [ ] Should probably use Jetty HTTP client. - [ ] Throughput with many update requests? Right now the request thread will be blocking on Tika response and parsing.. - [ ] If TikaServer supports HTTPS we'd probably need to handle self-signed SSL either through truststore, or custom config like we did for JWT-auth. - [ ] If TIkaServer supports Auth, that must be thought of - [ ] Complete refGuide docs - [ ] Split the huge PR into stages, i.e. first only support for pluggable backend, then add the tikaserver backend That concludes the "POC", proving that it is doable to do a drop-in replacement for users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
