Alternatively, just like we did with the DataImportHandler (DIH)[1],
we migrate the Tika stuff to an independent project/home on GitHub and
people install it if they need it.  Like the DIH, Solr's Tika
integration is quite popular/used so I expect it'll be maintained
instead of abandoned.  At that point, whether it's migrated to
TikaServer or whatever is a choice up to whoever the maintainer(s)
are.  I suppose proceeding in this direction requires volunteers.

[1] https://github.com/SearchScale/dataimporthandler

On Mon, Aug 12, 2024 at 1:15 PM Christos Malliaridis
<c.malliari...@gmail.com> wrote:
>
> I tried to find a java client for tika, but with no success so far.
>
> The version upgrade would reduce the vulnerabilities from about 21 CVEs to
> 6, so it would definitely be an improvement and probably worth the
> migration effort  until a client is available.
>
> On Mon, 12 Aug 2024, 18:15 Jan Høydahl, <jan....@cominvent.com> wrote:
>
> > Hi
> >
> > Wrt Tika, I had been hoping that we could replace extracting handler with
> > a processor that delegates to Tika Server, but is otherwise feature parity.
> > It would remove tons of dependencies and attack surface from Solr.
> >
> > I tried a POC once but could not find a suitable Java client for Tika
> > Server REST API. Perhaps that exists now?
> >
> > Jan Høydahl
> >
> > > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis <
> > c.malliari...@gmail.com>:
> > >
> > > Hello everyone,
> > >
> > > I've been looking into the dependencies of the project and thought that
> > we
> > > could update a couple of them, together with their license files
> > (wherever
> > > necessary).
> > >
> > > I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2,
> > > which is a huge step due to some restructuring of Apache Tika. The
> > affected
> > > modules are extraction and langid.
> > >
> > > There is a PR from solrbot <https://github.com/apache/solr/pull/2583>
> > that
> > > requires some manual work that I have already picked up for learning
> > > purposes. I'd like to create a ticket for the upgrade, but also saw that
> > > there is also SOLR-13973
> > > <https://issues.apache.org/jira/browse/SOLR-13973> that
> > > is titled "Deprecate Tika". From the age and conversation on the ticket,
> > it
> > > sounds like Tika will not be deprecated and the ticket can be closed.
> > But I
> > > am not sure and would like to ask for your input on this.
> > >
> > > In the migration to 2.9.2 it seems that there are some conflicts with the
> > > way the title from documents is extracted. Some metadata tags have also
> > > been removed / replaced, which needs more attention. See Migrating to
> > Tika
> > > 2.0.0
> > > <
> > https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
> > for
> > > more details.
> > >
> > > I'd be happy to create a PR for the upgrade and look into the fixes with
> > > someone that has already worked with Apache Tika 2.X or the affected
> > > modules (extraction/langid).
> > >
> > > Best,
> > > Christos
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > For additional commands, e-mail: dev-h...@solr.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Reply via email to