Alternatively, just like we did with the DataImportHandler (DIH)[1], we migrate the Tika stuff to an independent project/home on GitHub and people install it if they need it. Like the DIH, Solr's Tika integration is quite popular/used so I expect it'll be maintained instead of abandoned. At that point, whether it's migrated to TikaServer or whatever is a choice up to whoever the maintainer(s) are. I suppose proceeding in this direction requires volunteers.
[1] https://github.com/SearchScale/dataimporthandler On Mon, Aug 12, 2024 at 1:15 PM Christos Malliaridis <c.malliari...@gmail.com> wrote: > > I tried to find a java client for tika, but with no success so far. > > The version upgrade would reduce the vulnerabilities from about 21 CVEs to > 6, so it would definitely be an improvement and probably worth the > migration effort until a client is available. > > On Mon, 12 Aug 2024, 18:15 Jan Høydahl, <jan....@cominvent.com> wrote: > > > Hi > > > > Wrt Tika, I had been hoping that we could replace extracting handler with > > a processor that delegates to Tika Server, but is otherwise feature parity. > > It would remove tons of dependencies and attack surface from Solr. > > > > I tried a POC once but could not find a suitable Java client for Tika > > Server REST API. Perhaps that exists now? > > > > Jan Høydahl > > > > > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis < > > c.malliari...@gmail.com>: > > > > > > Hello everyone, > > > > > > I've been looking into the dependencies of the project and thought that > > we > > > could update a couple of them, together with their license files > > (wherever > > > necessary). > > > > > > I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2, > > > which is a huge step due to some restructuring of Apache Tika. The > > affected > > > modules are extraction and langid. > > > > > > There is a PR from solrbot <https://github.com/apache/solr/pull/2583> > > that > > > requires some manual work that I have already picked up for learning > > > purposes. I'd like to create a ticket for the upgrade, but also saw that > > > there is also SOLR-13973 > > > <https://issues.apache.org/jira/browse/SOLR-13973> that > > > is titled "Deprecate Tika". From the age and conversation on the ticket, > > it > > > sounds like Tika will not be deprecated and the ticket can be closed. > > But I > > > am not sure and would like to ask for your input on this. > > > > > > In the migration to 2.9.2 it seems that there are some conflicts with the > > > way the title from documents is extracted. Some metadata tags have also > > > been removed / replaced, which needs more attention. See Migrating to > > Tika > > > 2.0.0 > > > < > > https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> > > for > > > more details. > > > > > > I'd be happy to create a PR for the upgrade and look into the fixes with > > > someone that has already worked with Apache Tika 2.X or the affected > > > modules (extraction/langid). > > > > > > Best, > > > Christos > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org