[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024669#comment-18024669 ] Eric Pugh commented on SOLR-7632: - I had a great discussion with [~tallison] the other week, and he crystallized something about tika server versus tika pipes. Tika Server is a perfectly fine solution in a distributed cloud environment. Following our current architecture... YOu could imainge a pod of tika servers with a load balancer in front if you were ingesting at scale. However! If you want to run extraction on a local server without the cloud infrastructure, this is where tika pipes comes in. It eliminates the existing challenges that our current "local" implementation has that the java process that is doing extraction is the same java process supporting Solr. Plus all the jars that we need ship with Solr. Instead... With Tika Pipes, the Solr process talkes to Tika pipes who spawns a completely NEW java process that does extraction. The child process and Tika/Solr communicate via stdio which means the class path of Solr doesn't need any of the jars or depenedencies that the child tika process needs for extraction. they each have their own classpath. And if something goes wrong, well the hcild process crases/gets reaped, but Tika/Solr continues on it's merry way. To set up Tika Pipes, you do some config, (like we do for anything else) and what [~tallison] and I spitballed is a tika pipes parameter pointing the child process to a download of tika-standard-server-x.yx.jar file. So, to get all your tika dependencies, you just go grab that massive 63 mb jar file and point to it. No more CVE's for Solr project, there is a very small set of tika libs we need for Tika/Solr. Want NLP capablities, just go grab that tika jar and add it to the custom classpath for the child process. Since this all runs on your local server, you don't need another complete process, and may be more efficient depending on your workloads. Did I capture this [~tallison] ? So, in terms of our path, I think [~janhoy] that you are on the correct path. If we land the current PR, then we could in a seperate PR migrate our "local" plugin to Tika Pipes, which will give us the best of all worlds! > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 8h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18025348#comment-18025348 ] Tim Allison commented on SOLR-7632: --- Looks great, [~epugh] . I wanted to emphasize that [~janhoy]'s patch is definitely the right direction, and a huge step forward. Thank you, both! > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 8h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030383#comment-18030383 ] ASF subversion and git services commented on SOLR-7632: --- Commit f5f1c854c10578b97d014242db7430ab3a7147a9 in solr's branch refs/heads/branch_10x from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=solr.git;h=f5f1c854c10 ] SOLR-7632 TikaServer as pluggable backend to existing extraction handler (#3670) Co-authored-by: Eric Pugh Co-authored-by: Copilot <[email protected]> (cherry picked from commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166) > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Blocker > Labels: gsoc2017, memex, pull-request-available > Fix For: main (10.0) > > Time Spent: 15h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030480#comment-18030480 ] ASF subversion and git services commented on SOLR-7632: --- Commit 00a390c0606c348443aae72fafc87f74bfe8c78e in solr's branch refs/heads/branch_9x from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=solr.git;h=00a390c0606 ] SOLR-7632 SolrCell TikaServer support (branch_9x) (#3780) Co-authored-by: Eric Pugh Co-authored-by: Copilot <[email protected]> (cherry picked from commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166) > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Blocker > Labels: gsoc2017, memex, pull-request-available > Fix For: main (10.0), 9.10 > > Time Spent: 15h 40m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030481#comment-18030481 ] Jan Høydahl commented on SOLR-7632: --- Congrats everyone with landing a 10 years old JIRA 🎉 > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Blocker > Labels: gsoc2017, memex, pull-request-available > Fix For: main (10.0), 9.10 > > Time Spent: 15h 40m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030384#comment-18030384 ] ASF subversion and git services commented on SOLR-7632: --- Commit 4649657145653591e95cd296993a36f6e19f1495 in solr's branch refs/heads/branch_10_0 from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=solr.git;h=46496571456 ] SOLR-7632 TikaServer as pluggable backend to existing extraction handler (#3670) Co-authored-by: Eric Pugh Co-authored-by: Copilot <[email protected]> (cherry picked from commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166) > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Blocker > Labels: gsoc2017, memex, pull-request-available > Fix For: main (10.0) > > Time Spent: 15h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030381#comment-18030381 ] ASF subversion and git services commented on SOLR-7632: --- Commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166 in solr's branch refs/heads/main from Jan Høydahl [ https://gitbox.apache.org/repos/asf?p=solr.git;h=cca45c79ca6 ] SOLR-7632 TikaServer as pluggable backend to existing extraction handler (#3670) Co-authored-by: Eric Pugh Co-authored-by: Copilot <[email protected]> > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Blocker > Labels: gsoc2017, memex, pull-request-available > Fix For: main (10.0) > > Time Spent: 15h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029692#comment-18029692 ] Jan Høydahl commented on SOLR-7632: --- In my eyes, the PR is ready. It's got extra visibility on dev list and some review (although more would be useful). Plan to merge to main soon. > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 13h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024594#comment-18024594 ] Jan Høydahl commented on SOLR-7632: --- I’ll be able to continue on this work next week. Would appreciate feedback on the overall plan. Any worries or review comments you may have is valuable. > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 8h > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023061#comment-18023061 ] Jan Høydahl commented on SOLR-7632: --- Let me draw attention to my PR [https://github.com/apache/solr/pull/3670] which introduces the concept of Extraction Backends to the extraction handler. It implements the "local" existing code path as one backend. And it adds a new "tikaserver" backend implementation that delegates parsing to a remote Tika Server, eliminating the resource-hungry and dangerous in-process parsing in Solr itself. The PR is still a draft with many rough edges, but proves that it can serve as a drop-in replacement for the "local" backend, supporting "extractOnly", "capture", "xPath" and the rest of the update-params users use today. The plan is to first make the PR production ready, then split it in to parts. # The first part is the backend interface and local backend, with no new features. Merge it to branch_9x as well. # The second part is adding the new code and backport to 9x, while at the same time deprecating the local backend in 9x. # Third part is to remove the entire tika java dependency in main branch, leaving only the new implementation. Thoughts? > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 7h 40m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021264#comment-18021264 ] Eric Pugh commented on SOLR-7632: - I have done this as part of SOLR-13973. > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017308#comment-18017308 ] Jan Høydahl commented on SOLR-7632: --- +1 to mark it as deprecated now. > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017109#comment-18017109 ] David Smiley commented on SOLR-7632: I suggest marking it deprecated, and add upgrade guide notes to clarify what's going on (or what _may_ happen). Hopefully it will *move* to another place, as some of our other plugins have. I don't think it's a big concern to eject a module in a minor version. > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017089#comment-18017089 ] Eric Pugh commented on SOLR-7632: - Is there any way to mark Tika as "deprecated" in Solr 9, so that WHEN we land a replacement in Solr 10, we can remove it in the 10x line? And NOT have to support it through Solr 11. I'm thinking about the fact that Tika does bring a lot of dependencies, and that the big win of seperating it out is no longer having that dependency management work. It would be nice to experience fewer dependencies in Solr 10, and not wait till Solr 11... > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Assignee: Jan Høydahl >Priority: Major > Labels: gsoc2017, memex, pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953398#comment-17953398
]
Jan Høydahl commented on SOLR-7632:
---
Guys, this effort has been dormant for 2 years. I had a branch where I started
some experimentation but got lost by the lack of a java-client for Tika Server
and somewhat poor documentation.
Yesterday I watched the Google IO keynote and started playing with the Jules
tool. So in the search for a problem to throw at it, I figured why not find a
solr issue that is non-trivial, and see what it can make of it. So I promoted
it as follows:
{quote}Read https://issues.apache.org/jira/browse/SOLR-7632 which proposes to
deprecate the old "extraction" module, and replace it with an api-compatible
new module that instead of parsing rich text documents in-process with Tika,
will delegate to an externally running Tika-Server. More discussion can be
found in https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd.
Do not focus on deprecating the old module yet. Just make a working PR for the
new module. You can assume that the user has provisioned a TikaServer on some
URL. When writing tests for the module, a good idea could be to look at the
existing tests for "extraction" handler. You can choose whether you mock
TikaServer API in thests or spin up a TIkaServer using TestContainers.
The PullRequest should also add reference guide documentation for the new
feature.
{quote}
At first it used Apache HTTPClient and gave the new module an awkward name, so
I prompted it to change those two things with an additional prompt:
{quote}Please don't use Apache httpClient. Use Jetty httpclient instead, or JDK
httpclient. Please name the module "tika"
{quote}
And this PR is what it came up with after about 30min:
[https://github.com/apache/solr/pull/3361]
I have just skimmed the code and not tested it at all, but thought the
experiment was so interesting that I wanted to share it in a (Draft) PR which
will also run the tests. I do not have hopes of any production ready code, and
perhaps there will be push-back on the legality of using such a large
contribution from AI. But it gives a glimpse into how the future of software
development may change.
Next I'll look more into the code and tests and make up my own mind as to
whether this is good stuff and a possible starting point for the new module.
> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
> Labels: gsoc2017, memex, pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika
> fails it messes up the ExtractingRequestHandler (e.g., the document type
> caused Tika to fail, etc). A more reliable way and also separated, and easier
> to deploy version of the ExtractingRequestHandler would make a network call
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the
> results and then index the information that way. I have a patch in the works
> from the DARPA Memex project and I hope to post it soon.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705138#comment-17705138 ] Jan Høydahl commented on SOLR-7632: --- Following a [DISCUSS thread|https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd] on dev I'm working on a POC for this, in a hope to get rid Solr distribuion of the weight, complexity and inflexibility of bundled Tika. I already have code that forwards the input stream to Tika-Server and receives and parses the JSON response of the parsed doc. Will post a PR eventually. Once extraction module no longer depends on all the tika parsers, we can also move XLSXResponseWriter (depending on POI) out to a separate module. > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann >Priority: Major > Labels: gsoc2017, memex > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
