Re: [PR] SOLR-7632 TikaServer as pluggable backend to existing extraction handler [solr]

via GitHub Thu, 25 Sep 2025 15:54:20 -0700


janhoy commented on PR #3670:
URL: https://github.com/apache/solr/pull/3670#issuecomment-3336173646


   > Are you seeing any issues in how TikaServer works that maybe are better 
fixed there? Some great progress!
   
   I think not really. Only "quirk" I saw was that if you ingest a plain txt 
document, you get back an XML with a tile like `<title>&#0;</title>`. It is 
parsed by `TextAndCSVParser`. When injecting that XML into SAX parser, it bails 
out on invalid character, so I inserted an XML sanitizer.
   
   Other than that, I think TIka Server has what we need. It accepts password 
as HTTP header. And it accepts some PDF parser config also through headers. But 
more advanced parser config shuold be done on the TikaServer side, and good 
thing is that the user will have 100% control over their TikaServer and can 
configure it as they wish, much more than you could with SolrrCell.
   
   We should probably start using `/rmeta` endpoint (recursive meta) parser, 
since our current SolrCell parser is recursive. But that may mean some more 
advanced XML parsing? Have not really checked.
   
   I'm fairly optimistic on getting the remaining tests passing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-7632 TikaServer as pluggable backend to existing extraction handler [solr]

Reply via email to