Hey Eric,

I gave it a shot and made some changes to the extraction module,
successfully upgrading it to Tika to 3.2.3 (I also did an intermediate
upgrade to 2.9.4, see commit history). In a previous discussion about
upgrading Apache Tika[1] people mentioned the delegation to a Tika server,
and something about the Tika pipes. I am not familiar enough with those,
but maybe the upgrade to 3.2.3 would allow us to continue support for both
modules and reduce vulnerabilities, at least in Solr 10? Removing them
without enough feedback is a bit concerning to me (not that I am affected
in any way if they are removed).

Some dependencies like commons-io and log4j had to be updated, but besides
that I had no special dependency constraints that prevented me from
upgrading. I also don't know if with the upgrade some parsers are excluded,
as they have been split into multiple modules now. But we could add the
dependencies to other parser collections if needed, so that the auto-detect
parser can find them too.

The changes definitely are breaking changes, and many of the extracted
metadata now have different field names due to the standardization of the
fields in Tika 2. But let me know if I should finalize the PR for Solr 10.

You can find the PR here: https://github.com/apache/solr/pull/3674

Best,
Christos

[1] https://lists.apache.org/thread/yo21ocoyb373yoqpovdvs7jb339jwmgq

On Thu, Sep 18, 2025 at 8:22 PM David Eric Pugh <[email protected]>
wrote:

> Hey all, I've got a client who would like a CVE for Tika fixed in Solr:
> https://nvd.nist.gov/vuln/detail/CVE-2025-54988
> They use the langid module but don't use the extraction module.   So if
> the CVE is fixed in langid and they delete the extraction module from their
> deployments, they don't violate the CVE scanners.
> I took a stab at upgrading to Tika 3, and was able to do it for langid
> module, but ran into very hard times on updating the extraction module.
> It's on Tika 1, and so upgrading it is the same as redoing it from scratch,
> and I'd rather focus on getting a extraction module that delegated Tika
> processing to a seperate Tika Server.
> Before I try to press ahead with having Tika 1 and Tika 3 in Solr 9 and
> 10, I thought I would look for feedback on how crazy this would be?   It
> appears that even if we migrate extraction to Tika Server, we'll still have
> a tika jar to support how we are using it in the langid module....
> Eric

Reply via email to