[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-18 Thread Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024669#comment-18024669
 ] 

Eric Pugh commented on SOLR-7632:
-

I had a great discussion with [~tallison] the other week, and he crystallized 
something about tika server versus tika pipes.   

Tika Server is a perfectly fine solution in a distributed cloud environment.   
Following our current architecture...   YOu could imainge a pod of tika servers 
with a load balancer in front if you were ingesting at scale.    

However!   If you want to run extraction on a local server without the 
cloud infrastructure, this is where tika pipes comes in.  It eliminates the 
existing challenges that our current "local" implementation has that the java 
process that is doing extraction is the same java process supporting Solr.   
Plus all the jars that we need ship with Solr.     

Instead...  With Tika Pipes, the Solr process talkes to Tika pipes who spawns a 
completely NEW java process that does extraction.  The child process and 
Tika/Solr communicate via stdio which means the class path of Solr doesn't need 
any of the jars or depenedencies that the child tika process needs for 
extraction.   they each have their own classpath.   And if something goes 
wrong, well the hcild process crases/gets reaped, but Tika/Solr continues on 
it's merry way.    

 

To set up Tika Pipes, you do some config, (like we do for anything else) and 
what [~tallison] and I spitballed is a tika pipes parameter pointing the child 
process to a download of tika-standard-server-x.yx.jar file.    So, to get all 
your tika dependencies, you just go grab that massive 63 mb jar file and point 
to it.  No more CVE's for Solr project, there is a very small set of tika libs 
we need for Tika/Solr.  Want NLP capablities, just go grab that tika jar and 
add it to the custom classpath for the child process.

Since this all runs on your local server, you don't need another complete 
process, and may be more efficient depending on your workloads.   

Did I capture this [~tallison] ?

 

So, in terms of our path, I think [~janhoy] that you are on the correct path.  
If we land the current PR, then we could in a seperate PR migrate our "local" 
plugin to Tika Pipes, which will give us the best of all worlds!

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18025348#comment-18025348
 ] 

Tim Allison commented on SOLR-7632:
---

Looks great, [~epugh] . I wanted to emphasize that [~janhoy]'s patch is 
definitely the right direction, and a huge step forward. Thank you, both!

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030383#comment-18030383
 ] 

ASF subversion and git services commented on SOLR-7632:
---

Commit f5f1c854c10578b97d014242db7430ab3a7147a9 in solr's branch 
refs/heads/branch_10x from Jan Høydahl
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=f5f1c854c10 ]

SOLR-7632 TikaServer as pluggable backend to existing extraction handler (#3670)

Co-authored-by: Eric Pugh 
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166)


> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Blocker
>  Labels: gsoc2017, memex, pull-request-available
> Fix For: main (10.0)
>
>  Time Spent: 15h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030480#comment-18030480
 ] 

ASF subversion and git services commented on SOLR-7632:
---

Commit 00a390c0606c348443aae72fafc87f74bfe8c78e in solr's branch 
refs/heads/branch_9x from Jan Høydahl
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=00a390c0606 ]

SOLR-7632 SolrCell TikaServer support (branch_9x) (#3780)

Co-authored-by: Eric Pugh 
Co-authored-by: Copilot <[email protected]>

(cherry picked from commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166)


> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Blocker
>  Labels: gsoc2017, memex, pull-request-available
> Fix For: main (10.0), 9.10
>
>  Time Spent: 15h 40m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030481#comment-18030481
 ] 

Jan Høydahl commented on SOLR-7632:
---

Congrats everyone with landing a 10 years old JIRA 🎉

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Blocker
>  Labels: gsoc2017, memex, pull-request-available
> Fix For: main (10.0), 9.10
>
>  Time Spent: 15h 40m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030384#comment-18030384
 ] 

ASF subversion and git services commented on SOLR-7632:
---

Commit 4649657145653591e95cd296993a36f6e19f1495 in solr's branch 
refs/heads/branch_10_0 from Jan Høydahl
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=46496571456 ]

SOLR-7632 TikaServer as pluggable backend to existing extraction handler (#3670)

Co-authored-by: Eric Pugh 
Co-authored-by: Copilot <[email protected]>
(cherry picked from commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166)


> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Blocker
>  Labels: gsoc2017, memex, pull-request-available
> Fix For: main (10.0)
>
>  Time Spent: 15h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18030381#comment-18030381
 ] 

ASF subversion and git services commented on SOLR-7632:
---

Commit cca45c79ca65ec7335ddb37e58bb7f2bdaae3166 in solr's branch 
refs/heads/main from Jan Høydahl
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=cca45c79ca6 ]

SOLR-7632 TikaServer as pluggable backend to existing extraction handler (#3670)

Co-authored-by: Eric Pugh 
Co-authored-by: Copilot <[email protected]>

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Blocker
>  Labels: gsoc2017, memex, pull-request-available
> Fix For: main (10.0)
>
>  Time Spent: 15h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029692#comment-18029692
 ] 

Jan Høydahl commented on SOLR-7632:
---

In my eyes, the PR is ready. It's got extra visibility on dev list and some 
review (although more would be useful). Plan to merge to main soon.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 13h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-10-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024594#comment-18024594
 ] 

Jan Høydahl commented on SOLR-7632:
---

I’ll be able to continue on this work next week. Would appreciate feedback on 
the overall plan. Any worries or review comments you may have is valuable.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-09-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023061#comment-18023061
 ] 

Jan Høydahl commented on SOLR-7632:
---

Let me draw attention to my PR [https://github.com/apache/solr/pull/3670] which 
introduces the concept of Extraction Backends to the extraction handler.

It implements the "local" existing code path as one backend.

And it adds a new "tikaserver" backend implementation that delegates parsing to 
a remote Tika Server, eliminating the resource-hungry and dangerous in-process 
parsing in Solr itself.

The PR is still a draft with many rough edges, but proves that it can serve as 
a drop-in replacement for the "local" backend, supporting "extractOnly", 
"capture", "xPath" and the rest of the update-params users use today.

The plan is to first make the PR production ready, then split it in to parts.
 # The first part is the backend interface and local backend, with no new 
features. Merge it to branch_9x as well.
 # The second part is adding the new code and backport to 9x, while at the same 
time deprecating the local backend in 9x.
 # Third part is to remove the entire tika java dependency in main branch, 
leaving only the new implementation.

Thoughts?

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-09-18 Thread Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021264#comment-18021264
 ] 

Eric Pugh commented on SOLR-7632:
-

I have done this as part of SOLR-13973.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-08-31 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017308#comment-18017308
 ] 

Jan Høydahl commented on SOLR-7632:
---

+1 to mark it as deprecated now.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-08-29 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017109#comment-18017109
 ] 

David Smiley commented on SOLR-7632:


I suggest marking it deprecated, and add upgrade guide notes to clarify what's 
going on (or what _may_ happen).  Hopefully it will *move* to another place, as 
some of our other plugins have.  I don't think it's a big concern to eject a 
module in a minor version.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-08-29 Thread Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017089#comment-18017089
 ] 

Eric Pugh commented on SOLR-7632:
-

Is there any way to mark Tika as "deprecated" in Solr 9, so that WHEN we land a 
replacement in Solr 10, we can remove it in the 10x line?   And NOT have to 
support it through Solr 11.

 

I'm thinking about the fact that Tika does bring a lot of dependencies, and 
that the big win of seperating it out is no longer having that dependency 
management work.  It would be nice to experience fewer dependencies in Solr 10, 
and not wait till Solr 11...

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2025-05-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953398#comment-17953398
 ] 

Jan Høydahl commented on SOLR-7632:
---

Guys, this effort has been dormant for 2 years. I had a branch where I started 
some experimentation but got lost by the lack of a java-client for Tika Server 
and somewhat poor documentation.

Yesterday I watched the Google IO keynote and started playing with the Jules 
tool. So in the search for a problem to throw at it, I figured why not find a 
solr issue that is non-trivial, and see what it can make of it. So I promoted 
it as follows:
{quote}Read https://issues.apache.org/jira/browse/SOLR-7632 which proposes to 
deprecate the old "extraction" module, and replace it with an api-compatible 
new module that instead of parsing rich text documents in-process with Tika, 
will delegate to an externally running Tika-Server. More discussion can be 
found in https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd.

Do not focus on deprecating the old module yet. Just make a working PR for the 
new module. You can assume that the user has provisioned a TikaServer on some 
URL. When writing tests for the module, a good idea could be to look at the 
existing tests for "extraction" handler. You can choose whether you mock 
TikaServer API in thests or spin up a TIkaServer using TestContainers.

The PullRequest should also add reference guide documentation for the new 
feature. 
{quote}
At first it used Apache HTTPClient and gave the new module an awkward name, so 
I prompted it to change those two things with an additional prompt:
{quote}Please don't use Apache httpClient. Use Jetty httpclient instead, or JDK 
httpclient. Please name the module "tika"
{quote}
And this PR is what it came up with after about 30min: 
[https://github.com/apache/solr/pull/3361] 

I have just skimmed the code and not tested it at all, but thought the 
experiment was so interesting that I wanted to share it in a (Draft) PR which 
will also run the tests. I do not have hopes of any production ready code, and 
perhaps there will be push-back on the legality of using such a large 
contribution from AI. But it gives a glimpse into how the future of software 
development may change.

Next I'll look more into the code and tests and make up my own mind as to 
whether this is good stuff and a possible starting point for the new module.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Assignee: Jan Høydahl
>Priority: Major
>  Labels: gsoc2017, memex, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2023-03-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705138#comment-17705138
 ] 

Jan Høydahl commented on SOLR-7632:
---

Following a [DISCUSS 
thread|https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd] on dev 
I'm working on a POC for this, in a hope to get rid Solr distribuion of the 
weight, complexity and inflexibility of bundled Tika.

I already have code that forwards the input stream to Tika-Server and receives 
and parses the JSON response of the parsed doc. Will post a PR eventually.

Once extraction module no longer depends on all the tika parsers, we can also 
move XLSXResponseWriter (depending on POI) out to a separate module.

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Priority: Major
>  Labels: gsoc2017, memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]