[jira] [Commented] (SOLR-14973) Solr 8.6 is shipping libraries that are incompatible with each other

2020-11-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234714#comment-17234714
 ] 

Tim Allison commented on SOLR-14973:


Backporting and confirming that I didn't break anything takes a day of 
intermittent work.  If there are plans to do another 8.6.x release, I'll do it. 
 Otherwise, onwards... Thank you [~krisden]!

> Solr 8.6 is shipping libraries that are incompatible with each other
> 
>
> Key: SOLR-14973
> URL: https://issues.apache.org/jira/browse/SOLR-14973
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 8.6
>Reporter: Samir Huremovic
>Priority: Major
>  Labels: tika-parsers
>
> Hi,
> since Solr 8.6 the version of {{tika-parsers}} was updated to {{1.24}}. This 
> version of {{tika-parsers}} needs the {{poi}} library in version {{4.1.2}} 
> (see https://issues.apache.org/jira/browse/TIKA-3047) 
> Solr has version {{4.1.1}} of poi included.
> This creates (at least) a problem for parsing {{.xls}} files. The following 
> exception gets thrown by trying to post an {{.xls}} file in the techproducts 
> example:
> {{java.lang.NoSuchMethodError: 
> org.apache.poi.hssf.record.common.UnicodeString.getExtendedRst()Lorg/apache/poi/hssf/record/common/ExtRst;}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14973) Solr 8.6 is shipping libraries that are incompatible with each other

2020-11-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233697#comment-17233697
 ] 

Tim Allison commented on SOLR-14973:


Y. I believe that it is fixed in 8.7.0, too.  It looks like the Tika versions 
were upgraded with SOLR-14367, but none of its dependencies.  My fault was in 
not reviewing the commits back then.  Sorry.

https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e4b3fae7 
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=70d084c 

Should we backport SOLR-14439 to 8.6.x?

> Solr 8.6 is shipping libraries that are incompatible with each other
> 
>
> Key: SOLR-14973
> URL: https://issues.apache.org/jira/browse/SOLR-14973
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 8.6
>Reporter: Samir Huremovic
>Priority: Major
>  Labels: tika-parsers
>
> Hi,
> since Solr 8.6 the version of {{tika-parsers}} was updated to {{1.24}}. This 
> version of {{tika-parsers}} needs the {{poi}} library in version {{4.1.2}} 
> (see https://issues.apache.org/jira/browse/TIKA-3047) 
> Solr has version {{4.1.1}} of poi included.
> This creates (at least) a problem for parsing {{.xls}} files. The following 
> exception gets thrown by trying to post an {{.xls}} file in the techproducts 
> example:
> {{java.lang.NoSuchMethodError: 
> org.apache.poi.hssf.record.common.UnicodeString.getExtendedRst()Lorg/apache/poi/hssf/record/common/ExtRst;}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14973) Solr 8.6 is shipping libraries that are incompatible with each other

2020-11-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228776#comment-17228776
 ] 

Tim Allison commented on SOLR-14973:


 Thank you [~krisden] for the ping.

> Solr 8.6 is shipping libraries that are incompatible with each other
> 
>
> Key: SOLR-14973
> URL: https://issues.apache.org/jira/browse/SOLR-14973
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 8.6
>Reporter: Samir Huremovic
>Priority: Major
>  Labels: tika-parsers
>
> Hi,
> since Solr 8.6 the version of {{tika-parsers}} was updated to {{1.24}}. This 
> version of {{tika-parsers}} needs the {{poi}} library in version {{4.1.2}} 
> (see https://issues.apache.org/jira/browse/TIKA-3047) 
> Solr has version {{4.1.1}} of poi included.
> This creates (at least) a problem for parsing {{.xls}} files. The following 
> exception gets thrown by trying to post an {{.xls}} file in the techproducts 
> example:
> {{java.lang.NoSuchMethodError: 
> org.apache.poi.hssf.record.common.UnicodeString.getExtendedRst()Lorg/apache/poi/hssf/record/common/ExtRst;}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14439) Upgrade to Tika 1.24.1

2020-09-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-14439:
---
Fix Version/s: 8.7
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Upgrade to Tika 1.24.1
> --
>
> Key: SOLR-14439
> URL: https://issues.apache.org/jira/browse/SOLR-14439
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 8.7
>
> Attachments: SOLR-14339.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We recently released 1.24.1 with several fixes for DoS vulnerabilities we 
> found via fuzzing: CVE-2020-9489 https://seclists.org/oss-sec/2020/q2/69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14439) Upgrade to Tika 1.24.1

2020-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193732#comment-17193732
 ] 

Tim Allison commented on SOLR-14439:


I'll merge the PR tomorrow (Friday ET) against {{branch_8x}} if there aren't 
any objections.

> Upgrade to Tika 1.24.1
> --
>
> Key: SOLR-14439
> URL: https://issues.apache.org/jira/browse/SOLR-14439
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Attachments: SOLR-14339.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We recently released 1.24.1 with several fixes for DoS vulnerabilities we 
> found via fuzzing: CVE-2020-9489 https://seclists.org/oss-sec/2020/q2/69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14439) Upgrade to Tika 1.24.1

2020-09-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193151#comment-17193151
 ] 

Tim Allison commented on SOLR-14439:


Thank you [~erickerickson]!

> Upgrade to Tika 1.24.1
> --
>
> Key: SOLR-14439
> URL: https://issues.apache.org/jira/browse/SOLR-14439
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Attachments: SOLR-14339.patch
>
>
> We recently released 1.24.1 with several fixes for DoS vulnerabilities we 
> found via fuzzing: CVE-2020-9489 https://seclists.org/oss-sec/2020/q2/69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14439) Upgrade to Tika 1.24.1

2020-09-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193095#comment-17193095
 ] 

Tim Allison commented on SOLR-14439:


I'm currently working on this against the {{branch_8x}}.  I'll open a PR once I 
get a clean local build and local regression tests are favorable.  Should I 
backport to {{branch_8_6}} or is this too big of a change for that branch?

> Upgrade to Tika 1.24.1
> --
>
> Key: SOLR-14439
> URL: https://issues.apache.org/jira/browse/SOLR-14439
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Attachments: SOLR-14339.patch
>
>
> We recently released 1.24.1 with several fixes for DoS vulnerabilities we 
> found via fuzzing: CVE-2020-9489 https://seclists.org/oss-sec/2020/q2/69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14439) Upgrade to Tika 1.24.1

2020-09-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-14439:
---
Status: Patch Available  (was: Open)

> Upgrade to Tika 1.24.1
> --
>
> Key: SOLR-14439
> URL: https://issues.apache.org/jira/browse/SOLR-14439
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Attachments: SOLR-14339.patch
>
>
> We recently released 1.24.1 with several fixes for DoS vulnerabilities we 
> found via fuzzing: CVE-2020-9489 https://seclists.org/oss-sec/2020/q2/69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-09-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190716#comment-17190716
 ] 

Tim Allison commented on SOLR-13973:


So that'd be SOLR-7632 as [~erickerickson] pointed out?

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-09-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190711#comment-17190711
 ] 

Tim Allison commented on SOLR-13973:


[~mkalkbrenner] I've been thinking about adding an "indexer" endpoint to Tika.  
You'd configure your Solr/ES connection info and error handling choices via 
json at startup and then send the bytes to tika-server's /indexer endpoint.  It 
would parse the file and forward the result to Solr.  Would that simplify 
anything?

I'm thoroughly on board with "don't break the user experience", but we've got 
to get Tika out of Solr's jvm.

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158723#comment-17158723
 ] 

Tim Allison commented on SOLR-13973:


For ease of use with SolrJ and several other use cases(?), we could add a 
tika-client in the Tika project?

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158719#comment-17158719
 ] 

Tim Allison commented on SOLR-13973:


I cannot express the joy that will come to me, whether I'm the one to do it or 
not, to take out the kitchensink of dependencies that Tika has forced on Solr.

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158719#comment-17158719
 ] 

Tim Allison edited comment on SOLR-13973 at 7/15/20, 9:35 PM:
--

I cannot express the joy that will come to me, whether I'm the one to do it or 
not, to take out the kitchensink of dependencies that Tika has forced on Solr.

If we do want a forwarding option within tika-server, please chime in on 
TIKA-3093.

Otherwise, please let me know how I can help.  I suspect [~epugh] has a better 
sense of how to get started, and I stand by to help him.


was (Author: talli...@mitre.org):
I cannot express the joy that will come to me, whether I'm the one to do it or 
not, to take out the kitchensink of dependencies that Tika has forced on Solr.

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158678#comment-17158678
 ] 

Tim Allison commented on SOLR-13973:


I think [~epugh] just volunteered for this! 

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158639#comment-17158639
 ] 

Tim Allison edited comment on SOLR-13973 at 7/15/20, 7:15 PM:
--

I've been toying with adding a forwarding capability to tika-server 
(TIKA-3093).  So, if you curl a document to tika-server {{/tika2solr}}, we'd 
use our tika parsing stuff in tika-server and the extracted text to Solr.  This 
would keep the dangerous part (tika parsing a document) out of the client code.


was (Author: talli...@mitre.org):
I've been toying with adding a forwarding capability to tika-server.  So, if 
you curl a document to tika-server {{/tika2solr}}, we'd use our tika parsing 
stuff in tika-server and the extracted text to Solr.  This would keep the 
dangerous part (tika parsing a document) out of the client code.

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158639#comment-17158639
 ] 

Tim Allison commented on SOLR-13973:


I've been toying with adding a forwarding capability to tika-server.  So, if 
you curl a document to tika-server {{/tika2solr}}, we'd use our tika parsing 
stuff in tika-server and the extracted text to Solr.  This would keep the 
dangerous part (tika parsing a document) out of the client code.

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158629#comment-17158629
 ] 

Tim Allison edited comment on SOLR-13973 at 7/15/20, 7:06 PM:
--

[~ichattopadhyaya], thank you for the ping!

Y, I might be able to find some time to work on this over the next few weeks.

How do I start?  Do I have the freedom to start from greenfields (use 
tika-server), or do we need seamless migration with the same capabilities?


was (Author: talli...@mitre.org):
Y, I might be able to find some time to work on this over the next few weeks.

How do I start?  Do I have the freedom to start from greenfields (use 
tika-server), or do we need seamless migration with the same capabilities?

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13973) Deprecate Tika

2020-07-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158629#comment-17158629
 ] 

Tim Allison commented on SOLR-13973:


Y, I might be able to find some time to work on this over the next few weeks.

How do I start?  Do I have the freedom to start from greenfields (use 
tika-server), or do we need seamless migration with the same capabilities?

> Deprecate Tika
> --
>
> Key: SOLR-13973
> URL: https://issues.apache.org/jira/browse/SOLR-13973
> Project: Solr
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.7
>
>
> Solr's primary responsibility should be to focus on search and scalability. 
> Having to deal with the problems (CVEs) of Velocity, Tika etc. can slow us 
> down. I propose that we deprecate it going forward.
> Tika can be run outside Solr. Going forward, if someone wants to use these, 
> it should be possible to bring them into third party packages and installed 
> via package manager.
> Plan is to just to throw warnings in logs and add deprecation notes in 
> reference guide for now. Removal can be done in 9.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14439) Upgrade to Tika 1.24.1

2020-04-27 Thread Tim Allison (Jira)
Tim Allison created SOLR-14439:
--

 Summary: Upgrade to Tika 1.24.1
 Key: SOLR-14439
 URL: https://issues.apache.org/jira/browse/SOLR-14439
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: contrib - DataImportHandler
Reporter: Tim Allison
Assignee: Tim Allison


We recently released 1.24.1 with several fixes for DoS vulnerabilities we found 
via fuzzing: CVE-2020-9489 https://seclists.org/oss-sec/2020/q2/69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14367) Upgrade Tika to 1.24

2020-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068672#comment-17068672
 ] 

Tim Allison commented on SOLR-14367:


Ha...ok, our posts passed in the ether.  I'll standdown.  I'm more than happy 
to take this, though.  Let me know if you have luck.

> Upgrade Tika to 1.24
> 
>
> Key: SOLR-14367
> URL: https://issues.apache.org/jira/browse/SOLR-14367
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.5
>Reporter: mibo
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Upgrade Apache Tika to new released 1.24 to handle 
> [CVE-2020-1950|https://nvd.nist.gov/vuln/detail/CVE-2020-1950].
> Created [PR #1383|https://github.com/apache/lucene-solr/pull/1383] but 
> afterwards I found https://issues.apache.org/jira/browse/SOLR-14054 and it 
> looks like an update is much more complicated.
> I someone support me I will update my contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14367) Upgrade Tika to 1.24

2020-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068666#comment-17068666
 ] 

Tim Allison commented on SOLR-14367:


Y, we upgraded a bunch of dependencies and had to do some awful forking for 
metadata-extractor.  I'll take this.

[~mirbo], I strongly, strongly encourage you and everyone to avoid using the 
Tika integration with Solr. 

https://cwiki.apache.org/confluence/display/TIKA/UpgradingTikaInSolr

> Upgrade Tika to 1.24
> 
>
> Key: SOLR-14367
> URL: https://issues.apache.org/jira/browse/SOLR-14367
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.5
>Reporter: mibo
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Upgrade Apache Tika to new released 1.24 to handle 
> [CVE-2020-1950|https://nvd.nist.gov/vuln/detail/CVE-2020-1950].
> Created [PR #1383|https://github.com/apache/lucene-solr/pull/1383] but 
> afterwards I found https://issues.apache.org/jira/browse/SOLR-14054 and it 
> looks like an update is much more complicated.
> I someone support me I will update my contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14367) Upgrade Tika to 1.24

2020-03-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068660#comment-17068660
 ] 

Tim Allison commented on SOLR-14367:


I'll take a look.  That was a one-off problem, but upgrading is always a 
nightmare, and I look forward to getting Tika out of Solr asap.

Here's my idiot's guide: 
https://cwiki.apache.org/confluence/display/TIKA/UpgradingTikaInSolr


> Upgrade Tika to 1.24
> 
>
> Key: SOLR-14367
> URL: https://issues.apache.org/jira/browse/SOLR-14367
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.5
>Reporter: mibo
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Upgrade Apache Tika to new released 1.24 to handle 
> [CVE-2020-1950|https://nvd.nist.gov/vuln/detail/CVE-2020-1950].
> Created [PR #1383|https://github.com/apache/lucene-solr/pull/1383] but 
> afterwards I found https://issues.apache.org/jira/browse/SOLR-14054 and it 
> looks like an update is much more complicated.
> I someone support me I will update my contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2020-03-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057147#comment-17057147
 ] 

Tim Allison commented on SOLR-14054:


Thank you!  I realize it is trivial for you.  Onward!

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2020-03-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057143#comment-17057143
 ] 

Tim Allison commented on SOLR-14054:


Y.  I think that'd be best for Solr 8.x. The problem disappears in master with 
Java > 8.  Would you be willing to take that, or should I give it a spin?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2020-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056413#comment-17056413
 ] 

Tim Allison commented on SOLR-14054:


Would something like this be acceptable?

https://stackoverflow.com/a/24497206

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2020-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056403#comment-17056403
 ] 

Tim Allison edited comment on SOLR-14054 at 3/10/20, 8:54 PM:
--

We use xerces 2.12.0 which brings in xml-apis 1.4.01, which is needed by Java 
8...see above.  In master, we get rid of xml-apis because we don't need it with 
Java > 8.  

Any recommendations for a fix in 8.x when building with Java > 8?

Is there an ant/ivy version of maven's profiles, activated by Java > 8, e.g.: 
https://github.com/apache/pdfbox/blob/trunk/parent/pom.xml#L176 ?


was (Author: talli...@mitre.org):
We use xerces 2.12.0 which brings in xml-apis 1.4.01, which is needed by Java 
8...see above.  In master, we get rid of xml-apis because we don't need it with 
Java > 8.  

Any recommendations for a fix in 8.x when building with Java > 8?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2020-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056403#comment-17056403
 ] 

Tim Allison edited comment on SOLR-14054 at 3/10/20, 8:45 PM:
--

We use xerces 2.12.0 which brings in xml-apis 1.4.01, which is needed by Java 
8...see above.  In master, we get rid of xml-apis because we don't need it with 
Java > 8.  

Any recommendations for a fix in 8.x when building with Java > 8?


was (Author: talli...@mitre.org):
We use xerces 2.12.0 which brings in xml-apis 1.4.01, which is needed by Java 
8...see above.  In master, we get rid of xml-apis because we don't need it with 
Java > 8.  

Any recommendations for a fix in 8.x?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2020-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056403#comment-17056403
 ] 

Tim Allison commented on SOLR-14054:


We use xerces 2.12.0 which brings in xml-apis 1.4.01, which is needed by Java 
8...see above.  In master, we get rid of xml-apis because we don't need it with 
Java > 8.  

Any recommendations for a fix in 8.x?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14318) Missing dependency on commons-lang in solr-cell 8.4.1

2020-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056344#comment-17056344
 ] 

Tim Allison commented on SOLR-14318:


Y.  Confirmed we removed commons-lang from Tika in 1.23 so 8.5.

> Missing dependency on commons-lang in solr-cell 8.4.1
> -
>
> Key: SOLR-14318
> URL: https://issues.apache.org/jira/browse/SOLR-14318
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 8.4.1
>Reporter: Markus Günther
>Priority: Minor
>
> During a migration from Solr 7.x to Solr 8.4.1 we noticed that the 
> commons-lang:commons-lang:2.6 dependency has been removed, and thus, no 
> longer is part of org.apache.solr:solr-cell. solr-cell however comes bundled 
> with Apache Tika Parsers (org.apache.tika:tika-parsers) in version 1.19.1 
> which - although it is not an explicit dependency - does require 
> commons-lang:commons-lang:2.6.
> This raises an issue when trying to extract the content from Microsoft Access 
> database files using Tika. See the stacktrace below.
> {code:java}
> java.lang.NoClassDefFoundError: 
> org/apache/commons/lang/ObjectUtilsjava.lang.NoClassDefFoundError: 
> org/apache/commons/lang/ObjectUtils at 
> com.healthmarketscience.jackcess.util.SimpleColumnMatcher.equals(SimpleColumnMatcher.java:74)
>  at 
> com.healthmarketscience.jackcess.util.SimpleColumnMatcher.matches(SimpleColumnMatcher.java:46)
>  at 
> com.healthmarketscience.jackcess.util.CaseInsensitiveColumnMatcher.matches(CaseInsensitiveColumnMatcher.java:49)
>  at 
> com.healthmarketscience.jackcess.impl.CursorImpl.currentRowMatchesImpl(CursorImpl.java:571)
>  at 
> com.healthmarketscience.jackcess.impl.CursorImpl.findAnotherRowImpl(CursorImpl.java:627)
>  at 
> com.healthmarketscience.jackcess.impl.CursorImpl.findAnotherRow(CursorImpl.java:517)
>  at 
> com.healthmarketscience.jackcess.impl.CursorImpl.findFirstRow(CursorImpl.java:494)
>  at 
> com.healthmarketscience.jackcess.impl.DatabaseImpl$FallbackTableFinder.findRow(DatabaseImpl.java:2376)
>  at 
> com.healthmarketscience.jackcess.impl.DatabaseImpl$TableFinder.findObjectId(DatabaseImpl.java:2176)
>  at 
> com.healthmarketscience.jackcess.impl.DatabaseImpl.readSystemCatalog(DatabaseImpl.java:879)
>  at 
> com.healthmarketscience.jackcess.impl.DatabaseImpl.(DatabaseImpl.java:534)
>  at 
> com.healthmarketscience.jackcess.impl.DatabaseImpl.open(DatabaseImpl.java:401)
>  at 
> com.healthmarketscience.jackcess.DatabaseBuilder.open(DatabaseBuilder.java:252)
>  at 
> org.apache.tika.parser.microsoft.JackcessParser.parse(JackcessParser.java:94) 
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
>  at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:350) 
> at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:287) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
>  at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>  at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596) at 
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:578) at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>  at 
> 

[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2020-03-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056259#comment-17056259
 ] 

Tim Allison commented on SOLR-14054:


Looking...

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001255#comment-17001255
 ] 

Tim Allison commented on SOLR-14054:


Y

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14066) Deprecate DIH

2019-12-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1778#comment-1778
 ] 

Tim Allison commented on SOLR-14066:


If I'm not tracking this when it happens, please ping me on Tika stuff.  I'm 
happy to chip in and thrilled to get Tika out of Solr.

> Deprecate DIH
> -
>
> Key: SOLR-14066
> URL: https://issues.apache.org/jira/browse/SOLR-14066
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
> Attachments: image-2019-12-14-19-58-39-314.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> DataImportHandler has outlived its utility. DIH doesn't need to remain inside 
> Solr anymore. Let us deprecate DIH in 8.4 (and remove it from the Solr distro 
> in 9x or 10x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2019-12-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1772#comment-1772
 ] 

Tim Allison edited comment on SOLR-14054 at 12/19/19 1:39 PM:
--

I think this is resolved now: 
[https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-8.x-Solaris/463/] 
and the other test failures look unrelated. Please re-open if there are any 
(more) surprises.  Thank you, [~krisden], [~hossman]  and [~dweiss]!


was (Author: talli...@mitre.org):
I think this is resolved now: [|https://jenkins.thetaphi.de/view/Lucene-Solr/] 
[https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-8.4-Linux/] and 
Windows.  Please re-open if there are any (more) surprises.  Thank you, 
[~krisden], [~hossman]  and [~dweiss]!

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14054) Upgrade Tika to 1.23

2019-12-19 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved SOLR-14054.

Resolution: Fixed

I think this is resolved now: [|https://jenkins.thetaphi.de/view/Lucene-Solr/] 
[https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-8.4-Linux/] and 
Windows.  Please re-open if there are any (more) surprises.  Thank you, 
[~krisden], [~hossman]  and [~dweiss]!

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14113) Add more file types to DIH's unit tests to ensure dependency coverage

2019-12-18 Thread Tim Allison (Jira)
Tim Allison created SOLR-14113:
--

 Summary: Add more file types to DIH's unit tests to ensure 
dependency coverage
 Key: SOLR-14113
 URL: https://issues.apache.org/jira/browse/SOLR-14113
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tim Allison


As part of SOLR-14054, [~dweiss] noted that the unit tests pass without the 
commons-csv dependency, which is, in fact, required if a csv file is sent to 
DIH.  Let's add several more file types to the unit tests to include dependency 
coverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999295#comment-16999295
 ] 

Tim Allison commented on SOLR-14054:


Got it.  Thank you, [~dweiss]. I see that the Lucene benchmarks module also 
relies on xerces.  Should I add a dependency on xml-apis there, too?  Or, given 
that its unit tests pass, should we hope for the best?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999254#comment-16999254
 ] 

Tim Allison commented on SOLR-14054:


[~dweiss], will do on a separate issue if that's ok. 

 

You can tell Tika to avoid loading the TextAndCSVParser and use the TXTParser 
instead via tika-config.xml.  If you'd prefer this behavior either offline or 
in Solr, I can show you how to do that.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999247#comment-16999247
 ] 

Tim Allison commented on SOLR-14054:


[~hossman] I can replicate this now.  I should have caught this before the 
commit.  I clearly tested with Java 11 when I thought I was testing with Java 
8.  This is my fault.

 

The problem is solved if we add the xml-apis dependency, which xerces requires. 
 It looks like the earlier version of xerces didn't happen to require xml-apis 
on the execution paths the unit tests were exercising.  I can't explain why 
this isn't a problem with Java 11.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999151#comment-16999151
 ] 

Tim Allison commented on SOLR-14054:


[~dweiss], you are right.  I only found this issue when I ran all of Tika's 
unit test docs against the upgraded Solr.  I think users would be surprised to 
get a ClassNotFoundException when they send a csv file to DIH.  I can add unit 
tests for more file format coverage (including csv) or we can configure Tika to 
use only the TXTParser in Solr.  Let me know your preference.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999034#comment-16999034
 ] 

Tim Allison commented on SOLR-14054:


[~hossman]...ugh.  Worked locally.  Will take a look.  Sorry and thank you.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999033#comment-16999033
 ] 

Tim Allison commented on SOLR-14054:


[~dweiss], commons-csv is used in Tika's in TextAndCSVParser, which is new 
since 1.19.1.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998583#comment-16998583
 ] 

Tim Allison commented on SOLR-14054:


Let me know if I botched anything.  I _think_ we're good to go.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14054) Upgrade Tika to 1.23

2019-12-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved SOLR-14054.

Fix Version/s: 8.5
   Resolution: Fixed

Please reopen if I've broken anything.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 8.5
>
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997435#comment-16997435
 ] 

Tim Allison commented on SOLR-14054:


Thank you, [~krisden]!!!  I'll take a look. 

Unrelated to commons-compress, I may have found the source of the PDFontType1 
problem I was seeing: PDFBOX-4715.  IIUC, we need to fix this in PDFBox and 
Tika so that we can safely build both w JDK > 8.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995989#comment-16995989
 ] 

Tim Allison commented on SOLR-14054:


Seeing weird reproducibility issues...ugh.  Will pick up again on Monday.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995951#comment-16995951
 ] 

Tim Allison commented on SOLR-14054:


Thank you, Robert!  If we can get confirmation that I'm not doing something 
stupid -- that this really is a problem -- I'll open a new ticket.  I need to 
do some more investigation.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995800#comment-16995800
 ] 

Tim Allison edited comment on SOLR-14054 at 12/13/19 6:20 PM:
--

# clone master or my personal issue branch: 
https://github.com/tballison/lucene-solr/tree/jira/SOLR-14054 (tukaani issue 
happens in both).
# cd solr ... ant package
# unzip the shiny new Solr
# put the attached collection conf where it belongs
# start solr 
# {{curl 
'http://localhost:8983/solr/tika-integration-example/update/extract?literal.id=doc1=true'
 -F "myfile=@test-documents.7z"}}


was (Author: talli...@mitre.org):
# checkout https://github.com/tballison/lucene-solr/tree/jira/SOLR-14054
# cd solr ... ant package
# unzip the shiny new Solr
# put the attached collection conf where it belongs
# start solr 
# {{curl 
'http://localhost:8983/solr/tika-integration-example/update/extract?literal.id=doc1=true'
 -F "myfile=@test-documents.7z"}}

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995800#comment-16995800
 ] 

Tim Allison commented on SOLR-14054:


# checkout https://github.com/tballison/lucene-solr/tree/jira/SOLR-14054
# cd solr ... ant package
# unzip the shiny new Solr
# put the attached collection conf where it belongs
# start solr 
# {{curl 
'http://localhost:8983/solr/tika-integration-example/update/extract?literal.id=doc1=true'
 -F "myfile=@test-documents.7z"}}

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-14054:
---
Attachment: tika-integration-example-9.0.0-SNAPSHOT.tgz

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z, 
> tika-integration-example-9.0.0-SNAPSHOT.tgz
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated SOLR-14054:
---
Attachment: test-documents.7z

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: test-documents.7z
>
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995749#comment-16995749
 ] 

Tim Allison commented on SOLR-14054:


[~tilman]...please ignore...PDFBox issues appear to be spurious/user error.

[~krisden] will send reproduction steps shortly.  Thank you!


> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995724#comment-16995724
 ] 

Tim Allison commented on SOLR-14054:


> but /contrib/extraction/lib might not be in the core classloader?

Makes sense...I'm not able to replicate this problem in unit tests.  Do you 
know if classloading works differently in unit tests?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995683#comment-16995683
 ] 

Tim Allison commented on SOLR-14054:


I can replicate this reliably on Ubuntu 19.10, but I'm not seeing this issue on 
Mojave 10.14.6.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995675#comment-16995675
 ] 

Tim Allison commented on SOLR-14054:


[~tilman], I'm still getting this issue with PDFBox 2.0.17 when packaged in 
Solr.  Is this more likely to be a Solr issue or a PDFBox issue?

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995159#comment-16995159
 ] 

Tim Allison edited comment on SOLR-14054 at 12/13/19 2:36 PM:
--

I'm seeing similar behavior in Solr at least back to 8.3.1 and with other 
classes, e.g.:

{noformat}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.pdfbox.pdmodel.font.PDType1Font
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62) 
~[?:?]
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) 
~[?:?]
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
 ~[?:?]
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
 ~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) 
~[?:?]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) 
~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 
~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
~[?:?]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
~[?:?]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
 ~[?:?]
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 ~[?:?]
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:198)
 ~[?:?]
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2576) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) 
~[?:?]
{noformat}



was (Author: talli...@mitre.org):
I'm seeing similar behavior in Solr at least back to 8.3.1 but with different 
classes, e.g.:

{noformat}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.pdfbox.pdmodel.font.PDType1Font
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62) 
~[?:?]
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) 
~[?:?]
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
 ~[?:?]
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
 ~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) 
~[?:?]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) 
~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 
~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
~[?:?]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
~[?:?]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
 ~[?:?]
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 ~[?:?]
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:198)
 ~[?:?]
at 

[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995645#comment-16995645
 ] 

Tim Allison commented on SOLR-14054:


https://github.com/curationexperts/epigaea/issues/748 :P

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995634#comment-16995634
 ] 

Tim Allison edited comment on SOLR-14054 at 12/13/19 1:57 PM:
--

Interesting.  Thank you!  I'm reliably getting the PDType1Font and 
FontMapperImpl$DefaultFontProvider class loading issue back to 8.0.0.  How has 
this not been reported?! 


was (Author: talli...@mitre.org):
Interesting.  Thank you!  I'm reliably getting the PDType1Font class loading 
issue back to 8.0.0.  How has this not been reported?!  Will try different 
versions of Java.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-13 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995634#comment-16995634
 ] 

Tim Allison commented on SOLR-14054:


Interesting.  Thank you!  I'm reliably getting the PDType1Font class loading 
issue back to 8.0.0.  How has this not been reported?!  Will try different 
versions of Java.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995159#comment-16995159
 ] 

Tim Allison commented on SOLR-14054:


I'm seeing similar behavior in Solr at least back to 8.3.1 but with different 
classes, e.g.:

{noformat}
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.pdfbox.pdmodel.font.PDType1Font
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62) 
~[?:?]
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) 
~[?:?]
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
 ~[?:?]
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
 ~[?:?]
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
 ~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) 
~[?:?]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) 
~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 
~[?:?]
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
~[?:?]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
~[?:?]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
 ~[?:?]
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 ~[?:?]
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:198)
 ~[?:?]
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2576) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:799) 
~[?:?]
{noformat}


> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994975#comment-16994975
 ] 

Tim Allison commented on SOLR-14054:


I can reproduce this in master without this patch.

> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994927#comment-16994927
 ] 

Tim Allison edited comment on SOLR-14054 at 12/12/19 6:20 PM:
--

The tests all pass, and I can get a successful build locally, however when I 
try a full  integration test (package, unzip, deploy), I'm getting a 
NoClassDefFoundError:

{noformat}
Exception in thread "Thread-15" java.lang.NoClassDefFoundError: 
org/tukaani/xz/FilterOptions
at 
org.apache.commons.compress.archivers.sevenz.Coders.(Coders.java:47)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecoderStack(SevenZFile.java:1153)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecodingStream(SevenZFile.java:1106)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.getNextEntry(SevenZFile.java:405)
at 
org.apache.tika.parser.pkg.PackageParser$SevenZWrapper.getNextEntry(PackageParser.java:424)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:285)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.tukaani.xz.FilterOptions
{noformat}

I can tell from the logs that the jars in contrib/extraction/lib are loading:
{noformat}
Added 44 libs to classloader, from paths: 
[/home/tim/work/solr-9.0.0-SNAPSHOT/contrib/extraction/lib, 
/home/tim/work/solr-9.0.0-SNAPSHOT/dist]
{noformat}

The xz.jar is where it belongs and it is the right version, and when I unzip 
that jar, the class is there.

Any idea what might be going on?

Code here: https://github.com/tballison/lucene-solr/tree/jira/SOLR-14054



was (Author: talli...@mitre.org):
The tests all pass, and I can get a successful build locally, however when I 
try a full  integration test (package, unzip, deploy), I'm getting a 
NoClassDefFoundError:

{noformat}
Exception in thread "Thread-15" java.lang.NoClassDefFoundError: 
org/tukaani/xz/FilterOptions
at 
org.apache.commons.compress.archivers.sevenz.Coders.(Coders.java:47)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecoderStack(SevenZFile.java:1153)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecodingStream(SevenZFile.java:1106)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.getNextEntry(SevenZFile.java:405)
at 
org.apache.tika.parser.pkg.PackageParser$SevenZWrapper.getNextEntry(PackageParser.java:424)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:285)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at 

[jira] [Comment Edited] (SOLR-14054) Upgrade Tika to 1.23

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994927#comment-16994927
 ] 

Tim Allison edited comment on SOLR-14054 at 12/12/19 6:07 PM:
--

The tests all pass, and I can get a successful build locally, however when I 
try a full  integration test (package, unzip, deploy), I'm getting a 
NoClassDefFoundError:

{noformat}
Exception in thread "Thread-15" java.lang.NoClassDefFoundError: 
org/tukaani/xz/FilterOptions
at 
org.apache.commons.compress.archivers.sevenz.Coders.(Coders.java:47)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecoderStack(SevenZFile.java:1153)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecodingStream(SevenZFile.java:1106)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.getNextEntry(SevenZFile.java:405)
at 
org.apache.tika.parser.pkg.PackageParser$SevenZWrapper.getNextEntry(PackageParser.java:424)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:285)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.tukaani.xz.FilterOptions
{noformat}

I can tell from the logs that the jars in contrib/extraction/lib are loading:
{noformat}
Added 44 libs to classloader, from paths: 
[/home/tim/work/solr-9.0.0-SNAPSHOT/contrib/extraction/lib, 
/home/tim/work/solr-9.0.0-SNAPSHOT/dist]
{noformat}

The xz.jar is where it belongs and it is the right version, and when I unzip 
that jar, the class is there.

Any idea what might be going on?



was (Author: talli...@mitre.org):
The tests all pass, and I can get a successful build locally, however when I 
try a full  integration test, I'm getting a NoClassDeffFoundError:

{noformat}
Exception in thread "Thread-15" java.lang.NoClassDefFoundError: 
org/tukaani/xz/FilterOptions
at 
org.apache.commons.compress.archivers.sevenz.Coders.(Coders.java:47)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecoderStack(SevenZFile.java:1153)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecodingStream(SevenZFile.java:1106)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.getNextEntry(SevenZFile.java:405)
at 
org.apache.tika.parser.pkg.PackageParser$SevenZWrapper.getNextEntry(PackageParser.java:424)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:285)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at 

[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994927#comment-16994927
 ] 

Tim Allison commented on SOLR-14054:


The tests all pass, and I can get a successful build locally, however when I 
try a full  integration test, I'm getting a NoClassDeffFoundError:

{noformat}
Exception in thread "Thread-15" java.lang.NoClassDefFoundError: 
org/tukaani/xz/FilterOptions
at 
org.apache.commons.compress.archivers.sevenz.Coders.(Coders.java:47)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecoderStack(SevenZFile.java:1153)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.buildDecodingStream(SevenZFile.java:1106)
at 
org.apache.commons.compress.archivers.sevenz.SevenZFile.getNextEntry(SevenZFile.java:405)
at 
org.apache.tika.parser.pkg.PackageParser$SevenZWrapper.getNextEntry(PackageParser.java:424)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:285)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.tukaani.xz.FilterOptions
{noformat}

I can tell from the logs that the jars in contrib/extraction/lib are loading:
{noformat}
Added 44 libs to classloader, from paths: 
[/home/tim/work/solr-9.0.0-SNAPSHOT/contrib/extraction/lib, 
/home/tim/work/solr-9.0.0-SNAPSHOT/dist]
{noformat}

Any idea what might be going on?


> Upgrade Tika to 1.23
> 
>
> Key: SOLR-14054
> URL: https://issues.apache.org/jira/browse/SOLR-14054
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14054) Upgrade Tika to 1.23

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994711#comment-16994711
 ] 

Tim Allison commented on SOLR-14054:


This shouldn't be a problem, but I noticed that we can't bump guava to 28.1-jre:

{noformat}
java.lang.NoClassDefFoundError 
com/google/common/util/concurrent/internal/InternalFutureFailureAccess
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([EC9FF1FD80627747:E1D4DE448383E382]:0)
   [junit4]>at 
com.google.common.cache.LocalCache$LoadingValueReference.(LocalCache.java:3472)
   [junit4]>at 
com.google.common.cache.LocalCache$LoadingValueReference.(LocalCache.java:3476)
   [junit4]>at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2134)
   [junit4]>at 
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
   [junit4]>at 
com.google.common.cache.LocalCache.get(LocalCache.java:3953)
   [junit4]>at 
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4873)
   [junit4]>at 
org.apache.solr.schema.AbstractSpatialFieldType.getStrategy(AbstractSpatialFieldType.java:430)
   [junit4]>at 
org.apache.solr.schema.AbstractSpatialFieldType.createFields(AbstractSpatialFieldType.java:236)
   [junit4]>at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:65)
   [junit4]>at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:171)
   [junit4]>at 
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
   [junit4]>at 
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:969)
   [junit4]>at 
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:339)
   [junit4]>at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:286)
   [junit4]>at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:233)
   [junit4]>at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
   [junit4]>at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
   [junit4]>at 
org.apache.solr.update.processor.NestedUpdateProcessorFactory$NestedUpdateProcessor.processAdd(NestedUpdateProcessorFactory.java:79)
   [junit4]>at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
   [junit4]>at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259)
   [junit4]>at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489)
   [junit4]>at 
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339)
   [junit4]>at 
org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
   [junit4]>at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339)
   [junit4]>at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
   [junit4]>at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
   [junit4]>at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
   [junit4]>at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:332)
   [junit4]>at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:281)
   [junit4]>at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
   [junit4]>at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
   [junit4]>at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:236)
   [junit4]>at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
   [junit4]>at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
   [junit4]>at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:196)
   [junit4]>at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:127)
   [junit4]>at 

[jira] [Commented] (SOLR-14066) Deprecate DIH

2019-12-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994706#comment-16994706
 ] 

Tim Allison commented on SOLR-14066:


{quote}Then be “DIH is no longer part of Apache but thrives in its new home at 
github...”
{quote}
 

Is the notion that we'd break different components of DIH into different 
personal repos on github, like: [https://github.com/dadoonet/fscrawler]

I worry about moving critical code to personal repos, even though it can be 
forked/maintained by others.  And, by "critical", I appreciate  and completely 
agree with Jan's point about how it should be "demo only", but is in fact used 
across the land in production. :(

 

I'm very much in favor of moving Tika, at least, out of Solr...but to where?

Smaller, less pressing question: does this mean green fields (start fresh) for 
https://issues.apache.org/jira/browse/SOLR-7632?

In short, rather than implementing SOLR-7632, we should start a side project 
that uses tika-server as the default?

 

> Deprecate DIH
> -
>
> Key: SOLR-14066
> URL: https://issues.apache.org/jira/browse/SOLR-14066
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Ishan Chattopadhyaya
>Priority: Major
> Fix For: 8.4
>
>
> DataImportHandler has outlived its utility. DIH doesn't need to remain inside 
> Solr anymore. Let us deprecate DIH in 8.4 (and remove it from the Solr distro 
> in 9x or 10x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

2019-12-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993649#comment-16993649
 ] 

Tim Allison commented on SOLR-7632:
---

All, what's the current thinking on this?

Is there a model for plugins I can follow?  How do we test with external 
dependencies (e.g. a running tika-server)?

How far do we want to extricate Tika from Solr?  The farther the better, IMHO. 
:D

> Change the ExtractingRequestHandler to use Tika-Server
> --
>
> Key: SOLR-7632
> URL: https://issues.apache.org/jira/browse/SOLR-7632
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Chris A. Mattmann
>Priority: Major
>  Labels: gsoc2017, memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14054) Upgrade Tika to 1.23

2019-12-11 Thread Tim Allison (Jira)
Tim Allison created SOLR-14054:
--

 Summary: Upgrade Tika to 1.23
 Key: SOLR-14054
 URL: https://issues.apache.org/jira/browse/SOLR-14054
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: contrib - DataImportHandler
Reporter: Tim Allison
Assignee: Tim Allison


We just released 1.23.  Let's upgrade Tika.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org