[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664112#comment-16664112 ] Tim Allison commented on SOLR-12423: Thank you [~ctargett]! > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson >Priority: Major > Fix For: 7.6, master (8.0) > > Attachments: SOLR-12423.patch > > Time Spent: 50m > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! > This issue covers the basic upgrading to 1.19.1. For the migration to the > ForkParser, see SOLR-11721. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653918#comment-16653918 ] Tim Allison commented on SOLR-12423: W00t! Thank you, [~erickerickson]! > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson >Priority: Major > Fix For: 7.6, master (8.0) > > Attachments: SOLR-12423.patch > > Time Spent: 50m > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! > This issue covers the basic upgrading to 1.19.1. For the migration to the > ForkParser, see SOLR-11721. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653612#comment-16653612 ] Tim Allison commented on SOLR-12423: Y, I know...bummed I couldn't attend this year. No rush on my part. Thank you! > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! > This issue covers the basic upgrading to 1.19.1. For the migration to the > ForkParser, see SOLR-11721. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653585#comment-16653585 ] Tim Allison commented on SOLR-12423: Would a Solr committer be willing to help with this? Tika 1.19.1 fixes ~8 oom/infinite loop vulnerabilities: https://tika.apache.org/security.html > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! > This issue covers the basic upgrading to 1.19.1. For the migration to the > ForkParser, see SOLR-11721. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646796#comment-16646796 ] Tim Allison commented on SOLR-12423: I tested PR#468 against the ~650 unit test docs within Tika's project, and found no surprises. > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! > This issue covers the basic upgrading to 1.19.1. For the migration to the > ForkParser, see SOLR-11721. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12423: --- Description: In Tika 1.19, there will be the ability to call the ForkParser and specify a directory of jars from which to load the classes for the Parser in the child processes. This will allow us to remove all of the parser dependencies from Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar in the child process’ bin directory and be done with the upgrade... no more fiddly dependency upgrades and threat of jar hell. The ForkParser also protects against ooms, infinite loops and jvm crashes. W00t! This issue covers the basic upgrading to 1.19.1. For the migration to the ForkParser, see SOLR-11721. was: In Tika 1.19, there will be the ability to call the ForkParser and specify a directory of jars from which to load the classes for the Parser in the child processes. This will allow us to remove all of the parser dependencies from Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar in the child process’ bin directory and be done with the upgrade... no more fiddly dependency upgrades and threat of jar hell. The ForkParser also protects against ooms, infinite loops and jvm crashes. W00t! > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! > This issue covers the basic upgrading to 1.19.1. For the migration to the > ForkParser, see SOLR-11721. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19.1 when available
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12423: --- Summary: Upgrade to Tika 1.19.1 when available (was: Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser) > Upgrade to Tika 1.19.1 when available > - > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-12034) Replace TokenizerChain in Solr with Lucene's CustomAnalyzer
[ https://issues.apache.org/jira/browse/SOLR-12034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved SOLR-12034. Resolution: Won't Fix I can't see a way to implement this without wrecking the API for CustomAnalyzer's Builder(). Please re-open if there's a clean way to do this. > Replace TokenizerChain in Solr with Lucene's CustomAnalyzer > --- > > Key: SOLR-12034 > URL: https://issues.apache.org/jira/browse/SOLR-12034 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: David Smiley >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > Solr's TokenizerChain was created before Lucene's CustomAnalyzer was added, > and it duplicates much of CustomAnalyzer. Let's consider refactoring to > remove TokenizerChain. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12423) Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633961#comment-16633961 ] Tim Allison commented on SOLR-12423: Tika 1.19 fixed a number of vulnerabilities (https://tika.apache.org/security.html), but it has some issues. We should wait for 1.19.1. We'll be rolling rc2 as soon as PDFBox 2.0.12 is available, and the voting for PDFBox 2.0.12 should start today. > Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser > > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12423: --- Summary: Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser (was: Upgrade to Tika 1.19 when available and refactor to use the ForkParser) > Upgrade to Tika 1.19.1 when available and refactor to use the ForkParser > > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12551) Upgrade to Tika 1.18
[ https://issues.apache.org/jira/browse/SOLR-12551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543764#comment-16543764 ] Tim Allison commented on SOLR-12551: Yes. Why, yes I do. Thank you! > Upgrade to Tika 1.18 > > > Key: SOLR-12551 > URL: https://issues.apache.org/jira/browse/SOLR-12551 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Until 1.19 is ready (SOLR-12423), let's upgrade to 1.18. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12551) Upgrade to Tika 1.18
[ https://issues.apache.org/jira/browse/SOLR-12551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543732#comment-16543732 ] Tim Allison commented on SOLR-12551: I did the full integration tests with this against all of Tika's test files (with ucar files removed). > Upgrade to Tika 1.18 > > > Key: SOLR-12551 > URL: https://issues.apache.org/jira/browse/SOLR-12551 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Until 1.19 is ready (SOLR-12423), let's upgrade to 1.18. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12551) Upgrade to Tika 1.18
Tim Allison created SOLR-12551: -- Summary: Upgrade to Tika 1.18 Key: SOLR-12551 URL: https://issues.apache.org/jira/browse/SOLR-12551 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Reporter: Tim Allison Until 1.19 is ready (SOLR-12423), let's upgrade to 1.18. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12423) Upgrade to Tika 1.19 when available and refactor to use the ForkParser
[ https://issues.apache.org/jira/browse/SOLR-12423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12423: --- Environment: (was: in Tika 1.19) > Upgrade to Tika 1.19 when available and refactor to use the ForkParser > -- > > Key: SOLR-12423 > URL: https://issues.apache.org/jira/browse/SOLR-12423 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > > In Tika 1.19, there will be the ability to call the ForkParser and specify a > directory of jars from which to load the classes for the Parser in the child > processes. This will allow us to remove all of the parser dependencies from > Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar > in the child process’ bin directory and be done with the upgrade... no more > fiddly dependency upgrades and threat of jar hell. > > The ForkParser also protects against ooms, infinite loops and jvm crashes. > W00t! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12422) Update Ref Guide to recommend against using the ExtractingRequestHandler in production
[ https://issues.apache.org/jira/browse/SOLR-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12422: --- Description: [~elyograg] recently updated the wiki to include the hard-learned guidance that the ExtractingRequestHandler should not be used in production. [~ctargett] recommended updating the reference guide instead. Let’s update the ref guide. ...note to self...don't open issue on tiny screen...sorry for the clutter... > Update Ref Guide to recommend against using the ExtractingRequestHandler in > production > -- > > Key: SOLR-12422 > URL: https://issues.apache.org/jira/browse/SOLR-12422 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > > [~elyograg] recently updated the wiki to include the hard-learned guidance > that the ExtractingRequestHandler should not be used in production. > [~ctargett] recommended updating the reference guide instead. Let’s update > the ref guide. > > ...note to self...don't open issue on tiny screen...sorry for the clutter... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12422) Update Ref Guide to recommend against using the ExtractingRequestHandler in production
[ https://issues.apache.org/jira/browse/SOLR-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12422: --- Environment: (was: Shawn Heisey recently updated the wiki to include the hard-learned guidance that the ExtractingRequestHandler should not be used in production. Cassandra Targett recommended updating the reference guide instead. Let’s update the ref guide.) > Update Ref Guide to recommend against using the ExtractingRequestHandler in > production > -- > > Key: SOLR-12422 > URL: https://issues.apache.org/jira/browse/SOLR-12422 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12423) Upgrade to Tika 1.19 when available and refactor to use the ForkParser
Tim Allison created SOLR-12423: -- Summary: Upgrade to Tika 1.19 when available and refactor to use the ForkParser Key: SOLR-12423 URL: https://issues.apache.org/jira/browse/SOLR-12423 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Environment: in Tika 1.19 Reporter: Tim Allison In Tika 1.19, there will be the ability to call the ForkParser and specify a directory of jars from which to load the classes for the Parser in the child processes. This will allow us to remove all of the parser dependencies from Solr. We’ll still need tika-core, of course, but we could drop tika-app.jar in the child process’ bin directory and be done with the upgrade... no more fiddly dependency upgrades and threat of jar hell. The ForkParser also protects against ooms, infinite loops and jvm crashes. W00t! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12422) Update Ref Guide to recommend against using the ExtractingRequestHandler in production
Tim Allison created SOLR-12422: -- Summary: Update Ref Guide to recommend against using the ExtractingRequestHandler in production Key: SOLR-12422 URL: https://issues.apache.org/jira/browse/SOLR-12422 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Environment: Shawn Heisey recently updated the wiki to include the hard-learned guidance that the ExtractingRequestHandler should not be used in production. Cassandra Targett recommended updating the reference guide instead. Let’s update the ref guide. Reporter: Tim Allison -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391705#comment-16391705 ] Tim Allison commented on SOLR-11976: Done. > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Assignee: David Smiley >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390212#comment-16390212 ] Tim Allison commented on SOLR-11976: This issue will be moot after SOLR-12034 is in place. The other issues (linked from SOLR-12034) are relevant but not blockers on this nor blocked by this. So, until SOLR-12034 is in place, this is valid and should be ready for 7.3 (although the PR is against master, of course). > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Assignee: David Smiley >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8193) Deprecate LowercaseTokenizer
Tim Allison created LUCENE-8193: --- Summary: Deprecate LowercaseTokenizer Key: LUCENE-8193 URL: https://issues.apache.org/jira/browse/LUCENE-8193 Project: Lucene - Core Issue Type: Task Components: modules/analysis Reporter: Tim Allison On LUCENE-8186, discussion favored deprecating and eventually removing LowercaseTokenizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386042#comment-16386042 ] Tim Allison edited comment on LUCENE-8186 at 3/5/18 1:19 PM: - [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did when refactoring my code for SOLR-5410. :) was (Author: talli...@mitre.org): [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my code for SOLR-5410. :) > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > Attachments: LUCENE-8186.patch > > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386042#comment-16386042 ] Tim Allison edited comment on LUCENE-8186 at 3/5/18 1:18 PM: - [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my code for SOLR-5410. :) was (Author: talli...@mitre.org): [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my custom code. :) > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > Attachments: LUCENE-8186.patch > > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386042#comment-16386042 ] Tim Allison edited comment on LUCENE-8186 at 3/5/18 1:05 PM: - [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my custom code. :) was (Author: talli...@mitre.org): [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my custom code. > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > Attachments: LUCENE-8186.patch > > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386042#comment-16386042 ] Tim Allison commented on LUCENE-8186: - [~thetaphi], it works because multiterms are normalized in {{TextField}}'s {{analyzeMultiTerm}}: https://github.com/tballison/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/TextField.java#L168 , which uses the full analyzer including the tokenizer. AFAICT, {{TokenizerChain}}'s {{normalize()}} is never actually called at the moment, which, I'm guessing, is why no one found SOLR-11976 until I did in my custom code. > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > Attachments: LUCENE-8186.patch > > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12048) Cannot index formatted mail
[ https://issues.apache.org/jira/browse/SOLR-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382640#comment-16382640 ] Tim Allison commented on SOLR-12048: Sorry...didn't realize MailEntityProcessor is not using Tika for the main body processing...looking through MEP now... > Cannot index formatted mail > --- > > Key: SOLR-12048 > URL: https://issues.apache.org/jira/browse/SOLR-12048 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.1 >Reporter: Dimitris >Priority: Major > Attachments: index_no_content.txt, index_success.txt > > > Using /example/example-DIH/solr/mail/ configuration, a gmail mailbox has been > indexed. Nevertheless, only plain text mails are indexed. Formatted content > is not indexed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-12048) Cannot index formatted mail
[ https://issues.apache.org/jira/browse/SOLR-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382624#comment-16382624 ] Tim Allison edited comment on SOLR-12048 at 3/1/18 8:55 PM: Or probably u...@tika.apache.org :) +1 to closing this issue and moving the discussion to the Solr user list. In Tika <=1.17, these alternate bodies were treated as attachments, and we've fixed this for 1.18. Make sure to change {{processAttachement}} to true if you haven't! from {{mail-data-config.xml}} {noformat} {noformat} was (Author: talli...@mitre.org): Or probably u...@tika.apache.org :) In Tika <=1.17, these alternate bodies were treated as attachments, and we've fixed this for 1.18. Make sure to change {{processAttachement}} to true if you haven't! from {{mail-data-config.xml}} {noformat} {noformat} > Cannot index formatted mail > --- > > Key: SOLR-12048 > URL: https://issues.apache.org/jira/browse/SOLR-12048 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.1 >Reporter: Dimitris >Priority: Major > Attachments: index_no_content.txt, index_success.txt > > > Using /example/example-DIH/solr/mail/ configuration, a gmail mailbox has been > indexed. Nevertheless, only plain text mails are indexed. Formatted content > is not indexed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12048) Cannot index formatted mail
[ https://issues.apache.org/jira/browse/SOLR-12048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382624#comment-16382624 ] Tim Allison commented on SOLR-12048: Or probably u...@tika.apache.org :) In Tika <=1.17, these alternate bodies were treated as attachments, and we've fixed this for 1.18. Make sure to change {{processAttachement}} to true if you haven't! from {{mail-data-config.xml}} {noformat} {noformat} > Cannot index formatted mail > --- > > Key: SOLR-12048 > URL: https://issues.apache.org/jira/browse/SOLR-12048 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.1 >Reporter: Dimitris >Priority: Major > Attachments: index_no_content.txt, index_success.txt > > > Using /example/example-DIH/solr/mail/ configuration, a gmail mailbox has been > indexed. Nevertheless, only plain text mails are indexed. Formatted content > is not indexed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12035) ExtendedDismaxQParser fails to include charfilters in nostopanalyzer
[ https://issues.apache.org/jira/browse/SOLR-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12035: --- Affects Version/s: master (8.0) > ExtendedDismaxQParser fails to include charfilters in nostopanalyzer > > > Key: SOLR-12035 > URL: https://issues.apache.org/jira/browse/SOLR-12035 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In some circumstances, the ExtendedDismaxQParser tries to remove stop filters > from the TokenizerChain. When building the new analyzer without the stop > filters, the charfilters from the original TokenizerChain are not copied over. > The fix is trivial. > {noformat} > - TokenizerChain newa = new TokenizerChain(tcq.getTokenizerFactory(), > newtf); > + TokenizerChain newa = new TokenizerChain(tcq.getCharFilterFactories(), > tcq.getTokenizerFactory(), newtf); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12035) ExtendedDismaxQParser fails to include charfilters in nostopanalyzer
[ https://issues.apache.org/jira/browse/SOLR-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-12035: --- Component/s: query parsers > ExtendedDismaxQParser fails to include charfilters in nostopanalyzer > > > Key: SOLR-12035 > URL: https://issues.apache.org/jira/browse/SOLR-12035 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In some circumstances, the ExtendedDismaxQParser tries to remove stop filters > from the TokenizerChain. When building the new analyzer without the stop > filters, the charfilters from the original TokenizerChain are not copied over. > The fix is trivial. > {noformat} > - TokenizerChain newa = new TokenizerChain(tcq.getTokenizerFactory(), > newtf); > + TokenizerChain newa = new TokenizerChain(tcq.getCharFilterFactories(), > tcq.getTokenizerFactory(), newtf); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12035) ExtendedDismaxQParser fails to include charfilters in nostopanalyzer
Tim Allison created SOLR-12035: -- Summary: ExtendedDismaxQParser fails to include charfilters in nostopanalyzer Key: SOLR-12035 URL: https://issues.apache.org/jira/browse/SOLR-12035 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Tim Allison In some circumstances, the ExtendedDismaxQParser tries to remove stop filters from the TokenizerChain. When building the new analyzer without the stop filters, the charfilters from the original TokenizerChain are not copied over. The fix is trivial. {noformat} - TokenizerChain newa = new TokenizerChain(tcq.getTokenizerFactory(), newtf); + TokenizerChain newa = new TokenizerChain(tcq.getCharFilterFactories(), tcq.getTokenizerFactory(), newtf); {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated LUCENE-8186: Description: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does the filtering work. was: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does the filtering work. > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = > CustomAnalyzer.builder().withTokenizer(LowerCaseTokenizerFactory.class).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
[ https://issues.apache.org/jira/browse/LUCENE-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated LUCENE-8186: Description: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseTokenizer, does the filtering work. was: While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseAnalyzer, does the filtering work. > CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms > -- > > Key: LUCENE-8186 > URL: https://issues.apache.org/jira/browse/LUCENE-8186 > Project: Lucene - Core > Issue Type: Bug >Reporter: Tim Allison >Priority: Minor > > While working on SOLR-12034, a unit test that relied on the > LowerCaseTokenizerFactory failed. > After some digging, I was able to replicate this at the Lucene level. > Unit test: > {noformat} > @Test > public void testLCTokenizerFactoryNormalize() throws Exception { > Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new > LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); > //fails > assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); > > //now try an integration test with the classic query parser > QueryParser p = new QueryParser("f", analyzer); > Query q = p.parse("Hello"); > //passes > assertEquals(new TermQuery(new Term("f", "hello")), q); > q = p.parse("Hello*"); > //fails > assertEquals(new PrefixQuery(new Term("f", "hello")), q); > q = p.parse("Hel*o"); > //fails > assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); > } > {noformat} > The problem is that the CustomAnalyzer iterates through the tokenfilters, but > does not call the tokenizer, which, in the case of the LowerCaseTokenizer, > does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8186) CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms
Tim Allison created LUCENE-8186: --- Summary: CustomAnalyzer with a LowerCaseTokenizerFactory fails to normalize multiterms Key: LUCENE-8186 URL: https://issues.apache.org/jira/browse/LUCENE-8186 Project: Lucene - Core Issue Type: Bug Reporter: Tim Allison While working on SOLR-12034, a unit test that relied on the LowerCaseTokenizerFactory failed. After some digging, I was able to replicate this at the Lucene level. Unit test: {noformat} @Test public void testLCTokenizerFactoryNormalize() throws Exception { Analyzer analyzer = CustomAnalyzer.builder().withTokenizer(new LowerCaseTokenizerFactory(Collections.EMPTY_MAP)).build(); //fails assertEquals(new BytesRef("hello"), analyzer.normalize("f", "Hello")); //now try an integration test with the classic query parser QueryParser p = new QueryParser("f", analyzer); Query q = p.parse("Hello"); //passes assertEquals(new TermQuery(new Term("f", "hello")), q); q = p.parse("Hello*"); //fails assertEquals(new PrefixQuery(new Term("f", "hello")), q); q = p.parse("Hel*o"); //fails assertEquals(new WildcardQuery(new Term("f", "hel*o")), q); } {noformat} The problem is that the CustomAnalyzer iterates through the tokenfilters, but does not call the tokenizer, which, in the case of the LowerCaseAnalyzer, does the filtering work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376975#comment-16376975 ] Tim Allison commented on SOLR-11976: Y, I started on it...uncovered at least one other bug...it is a pretty big undertaking. Opened SOLR-12034. > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Assignee: David Smiley >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12034) Replace TokenizerChain in Solr with Lucene's CustomAnalyzer
Tim Allison created SOLR-12034: -- Summary: Replace TokenizerChain in Solr with Lucene's CustomAnalyzer Key: SOLR-12034 URL: https://issues.apache.org/jira/browse/SOLR-12034 Project: Solr Issue Type: Task Security Level: Public (Default Security Level. Issues are Public) Reporter: Tim Allison Solr's TokenizerChain was created before Lucene's CustomAnalyzer was added, and it duplicates much of CustomAnalyzer. Let's consider refactoring to remove TokenizerChain. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374303#comment-16374303 ] Tim Allison commented on SOLR-11976: Thank you. I'll update the PR. Should we also get rid of the special handling of multiterm analysis in TextField? Or, separate issue? > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372022#comment-16372022 ] Tim Allison commented on SOLR-11976: Ping...any committer interested in this or a larger PR to swap out {{TokenizerChain}} for {{CustomAnalyzer}}? > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364010#comment-16364010 ] Tim Allison commented on SOLR-11976: Better yet, swap out Solr's {{TokenizerChain}} for Lucene's {{CustomAnalyzer}} and deprecate {{TokenizerChain}} in 7.x? Happy to submit PR if a committer is willing to work with me on this. > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11976: --- Description: TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. This doesn't currently break search because {{normalize}} is not being used at the Solr level (AFAICT); rather, TextField has its own {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. Code as is: {noformat} TokenStream result = in; for (TokenFilterFactory filter : filters) { if (filter instanceof MultiTermAwareComponent) { filter = (TokenFilterFactory) ((MultiTermAwareComponent) filter).getMultiTermComponent(); result = filter.create(in); } } {noformat} The fix is simple: {noformat} -result = filter.create(in); +result = filter.create(result); {noformat} was: TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. This doesn't currently break search because {{normalize}} is not currently being used at the Solr level (AFAICT); rather, TextField has its own {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. Code as is: {noformat} TokenStream result = in; for (TokenFilterFactory filter : filters) { if (filter instanceof MultiTermAwareComponent) { filter = (TokenFilterFactory) ((MultiTermAwareComponent) filter).getMultiTermComponent(); result = filter.create(in); } } {noformat} The fix is simple: {noformat} -result = filter.create(in); +result = filter.create(result); {noformat} > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not being used > at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates code from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining TokenFilters in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11976: --- Summary: TokenizerChain is overwriting, not chaining TokenFilters in normalize() (was: TokenizerChain is overwriting, not chaining in normalize()) > TokenizerChain is overwriting, not chaining TokenFilters in normalize() > --- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not currently > being used at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361452#comment-16361452 ] Tim Allison commented on SOLR-11976: I'm happy to open a separate issue/PR to factor out {{TextField}}'s {{analyzeMultiTerm}} in favor of {{Analyzer#normalize()}}. > TokenizerChain is overwriting, not chaining in normalize() > -- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not currently > being used at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11976: --- Priority: Minor (was: Major) > TokenizerChain is overwriting, not chaining in normalize() > -- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not currently > being used at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()
[ https://issues.apache.org/jira/browse/SOLR-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11976: --- Affects Version/s: (was: 7.2) master (8.0) > TokenizerChain is overwriting, not chaining in normalize() > -- > > Key: SOLR-11976 > URL: https://issues.apache.org/jira/browse/SOLR-11976 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Major > > TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. > This doesn't currently break search because {{normalize}} is not currently > being used at the Solr level (AFAICT); rather, TextField has its own > {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. > Code as is: > {noformat} > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > {noformat} > The fix is simple: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-11976) TokenizerChain is overwriting, not chaining in normalize()
Tim Allison created SOLR-11976: -- Summary: TokenizerChain is overwriting, not chaining in normalize() Key: SOLR-11976 URL: https://issues.apache.org/jira/browse/SOLR-11976 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: search Affects Versions: 7.2 Reporter: Tim Allison TokenizerChain is overwriting, not chaining tokenfilters in {{normalize}}. This doesn't currently break search because {{normalize}} is not currently being used at the Solr level (AFAICT); rather, TextField has its own {{analyzeMultiTerm()}} that duplicates codes from the newer {{normalize}}. Code as is: {noformat} TokenStream result = in; for (TokenFilterFactory filter : filters) { if (filter instanceof MultiTermAwareComponent) { filter = (TokenFilterFactory) ((MultiTermAwareComponent) filter).getMultiTermComponent(); result = filter.create(in); } } {noformat} The fix is simple: {noformat} -result = filter.create(in); +result = filter.create(result); {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310271#comment-16310271 ] Tim Allison commented on SOLR-11701: Finally back to keyboard. Doh, and thank you!!! > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > Fix For: 7.3 > > Attachments: SOLR-11701.patch, SOLR-11701.patch > > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295483#comment-16295483 ] Tim Allison commented on SOLR-11701: Sounds good. _Thank you_! On the git conflict, y, that was caused by the recent addition of opennlp. I've updated the PR, but there are, of course, already new conflicts! :) Let me know if I can do anything to help with that. On the 401, I'm sure why that was happening...I'll take a look. On the unused imports, ugh. Thank you. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > Attachments: SOLR-11701.patch > > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295411#comment-16295411 ] Tim Allison commented on SOLR-11701: Back to keyboard. You're right in all of the above. When we bump slf4j from 1.7.7 to 1.7.24, its behavior changes to print out the full stacktrace instead of just the message. In org.slf4j.helpers.MessageFormatter in 1.7.7, the exception is counted as one of the members of {{argArray}}, and because of the following snippet, the {{throwableCandidate}} is nulled out in the returned {{FormattingTuple}} {noformat} if (L < argArray.length - 1) { return new FormattingTuple(sbuf.toString(), argArray, throwableCandidate); } else { return new FormattingTuple(sbuf.toString(), argArray, (Throwable)null); } {noformat} In 1.7.24, there's an added bit of logic before we get to that location that removes the exception from {{argArray}} so that it can't get swept into the message. {noformat} Object[] args = argArray; if (throwableCandidate != null) { args = trimmedCopy(argArray); } {noformat} I have in the back of my mind that there was a reason we upgraded slf4j in Tika. I'll look through our git history to see when/why and if we need to do it for the Solr integration. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > Attachments: SOLR-11701.patch > > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294225#comment-16294225 ] Tim Allison commented on SOLR-11701: Ugh. I’m still without keyboard. Can you tell which dependency is now adding more stuff? Will take a look tomorrow. Thank you for making it easy for me to replicate. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > Attachments: SOLR-11701.patch > > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293930#comment-16293930 ] Tim Allison commented on SOLR-11701: Away from tools now. Will look on Monday. Thank you! > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > Attachments: SOLR-11701.patch > > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293525#comment-16293525 ] Tim Allison commented on SOLR-11701: Y. Thank you! > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293243#comment-16293243 ] Tim Allison commented on SOLR-11701: K. I turned off the warnings with [d25349d|https://github.com/apache/lucene-solr/pull/291/commits/d25349dba44f8774683863092104fad8ea05c75d], and I reran the integration tests. That _should_ be ready to go. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292681#comment-16292681 ] Tim Allison commented on SOLR-11701: One more change... I'd like to turn off the missing jar warnings as the default in Solr. Update to PR coming soon, unless that should be a different issue. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291955#comment-16291955 ] Tim Allison commented on SOLR-11701: Yes, and please. Thank you! > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Erick Erickson > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291937#comment-16291937 ] Tim Allison commented on SOLR-11622: Turns out I did because you had done most of the work! :) See https://github.com/apache/lucene-solr/pull/291 over on SOLR-11701. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) >
[jira] [Commented] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291935#comment-16291935 ] Tim Allison commented on SOLR-11701: I merged [~kramachand...@commvault.com]'s mods and made a few updates for Tika 1.17. I ran an integration test against 643 files in Apache Tika's unit test docs, and I got the same # of documents indexed in Solr as tika-app.jar parsed without exceptions. {noformat} public static void main(String[] args) throws Exception { Path extracts = Paths.get("C:\\data\\tika_unit_tests_extracts"); SolrClient client = new HttpSolrClient.Builder("http://localhost:8983/solr/fileupload_passt/";).build(); for (File f : extracts.toFile().listFiles()) { try (Reader r = Files.newBufferedReader(f.toPath(), StandardCharsets.UTF_8)) { List metadataList = JsonMetadataList.fromJson(r); String ex = metadataList.get(0).get(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + "runtime"); if (ex == null) { SolrQuery q = new SolrQuery("id: "+f.getName().replace(".json", "")); QueryResponse response = client.query(q); SolrDocumentList results = response.getResults(); if (results.getNumFound() != 1) { System.err.println(f.getName() + " " + results.getNumFound()); } } } } } {noformat} I did the usual dance: {noformat} ant clean-jars jar-checksums ant precommit {noformat} [~erickerickson], this _should_ be good to go. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291483#comment-16291483 ] Tim Allison commented on SOLR-11622: [~kramachand...@commvault.com], if it is ok with you and if I have time, I'll try to submit a PR on SOLR-11701. If I don't have time, it will be all yours after you return. :) Sound good...or do you want the glory? For the last integration test I did, I put [these documents|https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents] in a directory and ran tika-app.jar against them. I then ran tika-eval.jar and counted the number of files without exceptions to get a ground truth count of how many files I'd expect to be in Solr. I then used DIH to import the same directory, with skip on error, and made sure there were the same # of documents in Solr. This uncovered several problems, which we'll fix in this issue or SOLR-11701. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291138#comment-16291138 ] Tim Allison commented on SOLR-11622: Sorry, right, yes, please and thank you. The question is whether Karthik wants to do a comprehensive upgrade to Tika 1.17 PR or whether I should...either way, with you, [~erickerickson] as the reviewer+committer. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291128#comment-16291128 ] Tim Allison commented on SOLR-11622: Thank you [~erickerickson]! Y, SOLR-11701 with [~kramachand...@commvault.com]'s fixes here could be unified into one PR that would upgrade us to Tika 1.17 and would fix numerous dependency problems that I found when I finally did an integration test with Tika's test files [above|https://issues.apache.org/jira/browse/SOLR-11622?focusedCommentId=16277347&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16277347]. This single PR would close out this issue, the SOLR-11693 and SOLR-11701 _and_ clean up problems I haven't even opened issues for (msaccess, and ...) [~kramachand...@commvault.com], would you like to have a go at SOLR-11701, plagiarizing my notes, or should I plagiarize your work for SOLR-11701. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) >
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291082#comment-16291082 ] Tim Allison commented on SOLR-11622: I'm not a committer on Lucene/Solr so I can't help. Sorry. Now that Tika 1.17 is out, it would be great to get that fully integrated, to include your fixes (SOLR-11701)...especially because this would fix a nasty regression that prevents pptx files with tables from getting indexed (SOLR-11693). [~shalinmangar] or [~thetaphi], if [~kramachand...@commvault.com] or I put together a PR for SOLR-11701, would you be willing to review and commit? This time, I'll run DIH against Tika's unit test documents before making the PR... > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.ecli
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277359#comment-16277359 ] Tim Allison commented on SOLR-11622: My {{ant-precommit}} had the usual build failure with broken links...So, I think we're good. :) > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277347#comment-16277347 ] Tim Allison commented on SOLR-11622: Smh...that we haven't run Solr against Tika's test files before/recently. This would have surfaced SOLR-11693. Unit tests would not have found that, but a full integration test would have. :( Speaking of which, with ref to [this|https://issues.apache.org/jira/browse/SOLR-11622?focusedCommentId=16274648&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16274648], I'm still getting the CTTable xsb error on our {{testPPT_various.pptx}}, and you can't just do a drop and replace POI-3.17-beta1 with POI-3.17, because there's a binary conflict on wmf files. That fix will require the upgrade to Tika 1.17, which should be on the way. I'm guessing that you aren't seeing that because of the luck of your classloader? > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) >
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277232#comment-16277232 ] Tim Allison commented on SOLR-11622: Finished analysis. Will submit PR to against your branch shortly. Working on {{ant precommit}} now. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.
[jira] [Comment Edited] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277232#comment-16277232 ] Tim Allison edited comment on SOLR-11622 at 12/4/17 6:43 PM: - Finished analysis. Will submit PR against your branch shortly. Working on {{ant precommit}} now. was (Author: talli...@mitre.org): Finished analysis. Will submit PR to against your branch shortly. Working on {{ant precommit}} now. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276976#comment-16276976 ] Tim Allison commented on SOLR-11622: Will do. I'm finding some other things that need to be fixed as well. I have no idea why neither I nor anyone else (apparently?) has run DIH on Tika's test files (at least recently?!)... We've got to change this in our processes. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch, SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.
[jira] [Updated] (SOLR-11721) Isolate most of Tika and dependencies into separate jvm
[ https://issues.apache.org/jira/browse/SOLR-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11721: --- Summary: Isolate most of Tika and dependencies into separate jvm (was: Isolate Tika and dependencies into separate jvm) > Isolate most of Tika and dependencies into separate jvm > --- > > Key: SOLR-11721 > URL: https://issues.apache.org/jira/browse/SOLR-11721 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison > > Tika should not be run in the same jvm as Solr. Ever. > Upgrading Tika and hoping to avoid jar hell, while getting all of the > dependencies right manually is, um, error prone. See my recent failure: > SOLR-11622, for which I apologize profusely. > Running DIH against Tika's unit test documents has been eye-opening. It has > revealed some other version conflict/dependency failures that should have > been caught much earlier. > The fix is non-trivial, but we should work towards it. > I see two options: > 1. TIKA-2514 -- Our current ForkParser offers a model for a minimal fork > process + server option. The limitation currently is that all parsers and > dependencies must be serializable, which can be a problem for users adding > their own parsers with deps that might not be designed for serializability. > The proposal there is to rework the ForkParser to use a TIKA_HOME directory > for all dependencies. > 2. SOLR-7632 -- use tika-server, but make it seamless and as easy (and > secure!) to use as the current handlers. > Other thoughts, recommendations? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server
[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276888#comment-16276888 ] Tim Allison commented on SOLR-7632: --- bq. To carry out Erik Hatcher's recommendation...I don't know if we'd need CORS for this or not, but it might be neat to modify Tika's server to allow users to inject their own resources=endpoints via a config file and an extra jar. Within the Solr project, we'd just have to implement a resource that takes an input stream, runs Tika and then adds a SolrInputDocument. [~gostep] has proposed allowing users to configure a custom ContentHandler in tika-server. This could enable Solr to create its own content handler that tika-server could use to send the extracted text to Solr on endDocument(). > Change the ExtractingRequestHandler to use Tika-Server > -- > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Chris A. Mattmann > Labels: gsoc2017, memex > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-11721) Isolate Tika and dependencies into separate jvm
Tim Allison created SOLR-11721: -- Summary: Isolate Tika and dependencies into separate jvm Key: SOLR-11721 URL: https://issues.apache.org/jira/browse/SOLR-11721 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Tim Allison Tika should not be run in the same jvm as Solr. Ever. Upgrading Tika and hoping to avoid jar hell, while getting all of the dependencies right manually is, um, error prone. See my recent failure: SOLR-11622, for which I apologize profusely. Running DIH against Tika's unit test documents has been eye-opening. It has revealed some other version conflict/dependency failures that should have been caught much earlier. The fix is non-trivial, but we should work towards it. I see two options: 1. TIKA-2514 -- Our current ForkParser offers a model for a minimal fork process + server option. The limitation currently is that all parsers and dependencies must be serializable, which can be a problem for users adding their own parsers with deps that might not be designed for serializability. The proposal there is to rework the ForkParser to use a TIKA_HOME directory for all dependencies. 2. SOLR-7632 -- use tika-server, but make it seamless and as easy (and secure!) to use as the current handlers. Other thoughts, recommendations? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274994#comment-16274994 ] Tim Allison edited comment on SOLR-11622 at 12/1/17 9:17 PM: - There's still a clash with jdom triggered by rss files and rometools {noformat} Exception in thread "Thread-21" java.lang.NoClassDefFoundError: org/jdom2/input/JDOMParseException at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:63) at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:51) {noformat} I'm confirming that should be bumped to 2.0.4. was (Author: talli...@mitre.org): There's still a clash with jdom triggered by rss files and rometools {noformat] Exception in thread "Thread-21" java.lang.NoClassDefFoundError: org/jdom2/input/JDOMParseException at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:63) at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:51) {noformat} I'm confirming that should be bumped to 2.0.4. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) >
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274994#comment-16274994 ] Tim Allison commented on SOLR-11622: There's still a clash with jdom triggered by rss files and rometools {noformat] Exception in thread "Thread-21" java.lang.NoClassDefFoundError: org/jdom2/input/JDOMParseException at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:63) at com.rometools.rome.io.SyndFeedInput.(SyndFeedInput.java:51) {noformat} I'm confirming that should be bumped to 2.0.4. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274578#comment-16274578 ] Tim Allison commented on SOLR-11622: Taking a look now. I want to run all of Tika's unit test docs through it to make sure I didn't botch anything else... You saw the POI bug in SOLR-11693? > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.jav
[jira] [Updated] (SOLR-11701) Upgrade to Tika 1.17 when available
[ https://issues.apache.org/jira/browse/SOLR-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11701: --- Description: Kicking off release process for Tika 1.17 in the next few days. Please let us know if you have any requests. > Upgrade to Tika 1.17 when available > --- > > Key: SOLR-11701 > URL: https://issues.apache.org/jira/browse/SOLR-11701 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison > > Kicking off release process for Tika 1.17 in the next few days. Please let > us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11693) Class loading problem for Tika/POI for some PPTX
[ https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11693: --- Description: [~advokat] reported TIKA-2497. I can reproduce this issue with a Solr instance in both 6.6.2 and 7.1.0. I can't reproduce it when I run the triggering file within Solr's unit tests or with straight Tika. Would anyone with more knowledge of classloading within Solr be able to help? See TIKA-2497 for triggering file and conf files. ...turns out this is a bug in POI 3.16 and 3.17-beta1 was: [~advokat] reported TIKA-2497. I can reproduce this issue with a Solr instance in both 6.6.2 and 7.1.0. I can't reproduce it when I run the triggering file within Solr's unit tests or with straight Tika. I can see CTTable as a class where it belongs in contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar. Would anyone with more knowledge of classloading within Solr be able to help? See TIKA-2497 for triggering file and conf files. Stacktrace: {noformat} 500204org.apache.solr.common.SolrExceptionjava.lang.IllegalStateExceptionorg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) ... 34 more Caused by: java.lang.IllegalStateException: Schemas (*.xsb) for CTTable can't be loaded - usually this happ
[jira] [Commented] (SOLR-11693) Class loading problem for Tika/POI for some PPTX
[ https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270677#comment-16270677 ] Tim Allison commented on SOLR-11693: [~yegor.kozlov] noted on the POI dev list that this is now fixed in POI 3.17. > Class loading problem for Tika/POI for some PPTX > > > Key: SOLR-11693 > URL: https://issues.apache.org/jira/browse/SOLR-11693 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: contrib - DataImportHandler >Affects Versions: 7.1 >Reporter: Tim Allison >Priority: Minor > > [~advokat] reported TIKA-2497. I can reproduce this issue with a Solr > instance in both 6.6.2 and 7.1.0. > I can't reproduce it when I run the triggering file within Solr's unit tests > or with straight Tika. I can see CTTable as a class where it belongs in > contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar. > Would anyone with more knowledge of classloading within Solr be able to help? > See TIKA-2497 for triggering file and conf files. > Stacktrace: > {noformat} > > 500 name="QTime">204 name="error-class">org.apache.solr.common.SolrException name="root-error-class">java.lang.IllegalStateException name="msg">org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 name="trace">org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Unknown Source) > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 > at org.
[jira] [Created] (SOLR-11701) Upgrade to Tika 1.17 when available
Tim Allison created SOLR-11701: -- Summary: Upgrade to Tika 1.17 when available Key: SOLR-11701 URL: https://issues.apache.org/jira/browse/SOLR-11701 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Tim Allison -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11622) Bundled mime4j library not sufficient for Tika requirement
[ https://issues.apache.org/jira/browse/SOLR-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269524#comment-16269524 ] Tim Allison commented on SOLR-11622: Y. This was my mistake/omission in SOLR-10335. Ugh. > Bundled mime4j library not sufficient for Tika requirement > -- > > Key: SOLR-11622 > URL: https://issues.apache.org/jira/browse/SOLR-11622 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Build >Affects Versions: 7.1, 6.6.2 >Reporter: Karim Malhas >Assignee: Karthik Ramachandran >Priority: Minor > Labels: build > Attachments: SOLR-11622.patch > > > The version 7.2 of Apache James Mime4j bundled with the Solr binary releases > does not match what is required by Apache Tika for parsing rfc2822 messages. > The master branch for james-mime4j seems to contain the missing Builder class > [https://github.com/apache/james-mime4j/blob/master/core/src/main/java/org/apache/james/mime4j/stream/MimeConfig.java > ] > This prevents import of rfc2822 formatted messages. For example like so: > {{./bin/post -c dovecot -type 'message/rfc822' 'testdata/email_01.txt' > }} > And results in the following stacktrace: > java.lang.NoClassDefFoundError: > org/apache/james/mime4j/stream/MimeConfig$Builder > at > org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:63) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduce
[jira] [Updated] (SOLR-11693) Class loading problem for Tika/POI for some PPTX
[ https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11693: --- Affects Version/s: 7.1 > Class loading problem for Tika/POI for some PPTX > > > Key: SOLR-11693 > URL: https://issues.apache.org/jira/browse/SOLR-11693 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: contrib - DataImportHandler >Affects Versions: 7.1 >Reporter: Tim Allison >Priority: Minor > > [~advokat] reported TIKA-2497. I can reproduce this issue with a Solr > instance in both 6.6.2 and 7.1.0. > I can't reproduce it when I run the triggering file within Solr's unit tests > or with straight Tika. I can see CTTable as a class where it belongs in > contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar. > Would anyone with more knowledge of classloading within Solr be able to help? > See TIKA-2497 for triggering file and conf files. > Stacktrace: > {noformat} > > 500 name="QTime">204 name="error-class">org.apache.solr.common.SolrException name="root-error-class">java.lang.IllegalStateException name="msg">org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 name="trace">org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Unknown Source) > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at org.apache.tika.parser.CompositeParse
[jira] [Updated] (SOLR-11693) Class loading problem for Tika/POI for some PPTX
[ https://issues.apache.org/jira/browse/SOLR-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11693: --- Priority: Minor (was: Major) > Class loading problem for Tika/POI for some PPTX > > > Key: SOLR-11693 > URL: https://issues.apache.org/jira/browse/SOLR-11693 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: contrib - DataImportHandler >Affects Versions: 7.1 >Reporter: Tim Allison >Priority: Minor > > [~advokat] reported TIKA-2497. I can reproduce this issue with a Solr > instance in both 6.6.2 and 7.1.0. > I can't reproduce it when I run the triggering file within Solr's unit tests > or with straight Tika. I can see CTTable as a class where it belongs in > contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar. > Would anyone with more knowledge of classloading within Solr be able to help? > See TIKA-2497 for triggering file and conf files. > Stacktrace: > {noformat} > > 500 name="QTime">204 name="error-class">org.apache.solr.common.SolrException name="root-error-class">java.lang.IllegalStateException name="msg">org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 name="trace">org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Unknown Source) > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at org.apache.tika.parser.Composi
[jira] [Created] (SOLR-11693) Class loading problem for Tika/POI for some PPTX
Tim Allison created SOLR-11693: -- Summary: Class loading problem for Tika/POI for some PPTX Key: SOLR-11693 URL: https://issues.apache.org/jira/browse/SOLR-11693 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: contrib - DataImportHandler Reporter: Tim Allison [~advokat] reported TIKA-2497. I can reproduce this issue with a Solr instance in both 6.6.2 and 7.1.0. I can't reproduce it when I run the triggering file within Solr's unit tests or with straight Tika. I can see CTTable as a class where it belongs in contrib/extract/lib/poi-ooxml-schemas-3.17-beta1.jar. Would anyone with more knowledge of classloading within Solr be able to help? See TIKA-2497 for triggering file and conf files. Stacktrace: {noformat} 500204org.apache.solr.common.SolrExceptionjava.lang.IllegalStateExceptionorg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@3225ac62 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) ... 34 more Caused by: java.lang.IllegalStateException: Schemas (*.xsb) for CTTable can't be loaded - usually this happens when OSGI loading is used and the thread context classloader has no reference to the xmlbeans classes - use POIXMLTypeLoader.setClassLoader() to set th
[jira] [Commented] (SOLR-8981) Upgrade to Tika 1.13 when it is available
[ https://issues.apache.org/jira/browse/SOLR-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212527#comment-16212527 ] Tim Allison commented on SOLR-8981: --- +1 Thank you, [~thetaphi]! > Upgrade to Tika 1.13 when it is available > - > > Key: SOLR-8981 > URL: https://issues.apache.org/jira/browse/SOLR-8981 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) >Reporter: Tim Allison >Assignee: Uwe Schindler > Fix For: 5.5.5, 6.2, 7.0 > > > Tika 1.13 should be out within a month. This includes PDFBox 2.0.0 and a > number of other upgrades and improvements. > If there are any showstoppers in 1.13 from Solr's side or requests before we > roll 1.13, let us know. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10335) Upgrade to Tika 1.16 when available
[ https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204121#comment-16204121 ] Tim Allison commented on SOLR-10335: Thank you, again! > Upgrade to Tika 1.16 when available > --- > > Key: SOLR-10335 > URL: https://issues.apache.org/jira/browse/SOLR-10335 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Shalin Shekhar Mangar >Priority: Critical > Fix For: 7.1, 6.6.2 > > > Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of > Tika 1.15. > Please let us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203840#comment-16203840 ] Tim Allison commented on SOLR-11450: bq. I'm not familiar enough with Solr query parsers Y, I've been away from this for too long and got the first couple of answers to [~bjarkebm] wrong on the user list because of the diff btwn Lucene and Solr. It is good to be back. Thank you! > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10335) Upgrade to Tika 1.16 when available
[ https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203804#comment-16203804 ] Tim Allison commented on SOLR-10335: [~shalinmangar], should I submit another PR for the 6_x and 6.6.2 branch or will you take it from here? THANK YOU!!! > Upgrade to Tika 1.16 when available > --- > > Key: SOLR-10335 > URL: https://issues.apache.org/jira/browse/SOLR-10335 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Shalin Shekhar Mangar >Priority: Critical > > Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of > Tika 1.15. > Please let us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-10335) Upgrade to Tika 1.16 when available
[ https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203804#comment-16203804 ] Tim Allison edited comment on SOLR-10335 at 10/13/17 4:32 PM: -- [~shalinmangar], should I submit another PR for the 6_x and 6.6.2 branches or will you take it from here? THANK YOU!!! was (Author: talli...@mitre.org): [~shalinmangar], should I submit another PR for the 6_x and 6.6.2 branch or will you take it from here? THANK YOU!!! > Upgrade to Tika 1.16 when available > --- > > Key: SOLR-10335 > URL: https://issues.apache.org/jira/browse/SOLR-10335 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Shalin Shekhar Mangar >Priority: Critical > > Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of > Tika 1.15. > Please let us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203778#comment-16203778 ] Tim Allison edited comment on SOLR-11450 at 10/13/17 4:30 PM: -- Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and/or swapping in a KeywordAnalyzer --[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...almost like {{Analyzer.normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser but not fully for the CPQP in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. was (Author: talli...@mitre.org): Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and/or swapping in a KeywordAnalyzer --[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...just like {{Analyzer.normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser but not fully for the CPQP in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203778#comment-16203778 ] Tim Allison edited comment on SOLR-11450 at 10/13/17 4:20 PM: -- Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and/or swapping in a KeywordAnalyzer --[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...just like {{Analyzer.normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser but not fully for the CPQP in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. was (Author: talli...@mitre.org): Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and/or swapping in a KeywordAnalyzer --[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...just like {{Analyzer.normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203778#comment-16203778 ] Tim Allison edited comment on SOLR-11450 at 10/13/17 4:19 PM: -- Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and/or swapping in a KeywordAnalyzer --[here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...just like {{Analyzer.normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. was (Author: talli...@mitre.org): Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and swapping in a KeywordTokenizer --[here|http://example.com] [https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...just CustomAnalyzer's {{normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203778#comment-16203778 ] Tim Allison commented on SOLR-11450: Ha. Right. Solr does do its own thing. {{FieldTypePluginLoader}} generates a multiterm analyzer in the TextField by subsetting the TokenizerChain's components that are MultitermAware and swapping in a KeywordTokenizer --[here|http://example.com] [https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/schema/FieldTypePluginLoader.java#L182] ...just CustomAnalyzer's {{normalize()}} in 7.x :) Then {{SolrQueryParserBase}} has an {{analyzeIfMultiTermText}} [here|https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L883], which in turn calls {{TextField}}'s {{analyzeMultiTerm}} with {{TextField}}'s multitermanalyzer that was built back in the {{FieldTypePluginLoader}} above. So, in Solr 6.x, the basic QueryParser relies on the SolrQueryParserBase and all is good. However, the CPQP doesn't extend the SolrQueryParserBase. Two things make this feel like a bug and not a feature in Solr 6.x: 1) multiterm analysis works for the classic query parser in Solr 6.x 2) multiterm analysis works for CPQP for some multiterms (wildcard/reverse wildcard) and range, but not in the other multiterms: prefix, regex and fuzzy. > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203573#comment-16203573 ] Tim Allison commented on SOLR-11450: [~jpountz], thank you for your response! Y, the changes in 7.x are fantastic. Am I misunderstanding 6.x, though? This test passes, which suggests that normalization was working correctly for the classic queryparser in 6.x, but not the cpqp. Or am I misunderstanding? If your point is that this would be a breaking change for some users of cpqp and it therefore doesn't belong in a bugfix release, I'm willing to accept that. {noformat} @Test public void testCharFilter() { assertU(adoc("iso-latin1", "craezy traen", "id", "1")); assertU(commit()); assertU(optimize()); assertQ(req("q", "iso-latin1:cr\u00E6zy") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:tr\u00E6n") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:c\u00E6zy~1") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:cr\u00E6z*") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:*\u00E6zy") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:cr\u00E6*y") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:/cr\u00E6[a-z]y/") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); assertQ(req("q", "iso-latin1:[cr\u00E6zx TO cr\u00E6zz]") , "//result[@numFound='1']" , "//doc[./str[@name='id']='1']" ); } {noformat} > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203541#comment-16203541 ] Tim Allison commented on SOLR-11450: [~mkhludnev] or any other committer willing to review and push for 6.6.2? > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11450: --- Labels: patch-with-test (was: ) > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Labels: patch-with-test > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10335) Upgrade to Tika 1.16 when available
[ https://issues.apache.org/jira/browse/SOLR-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203529#comment-16203529 ] Tim Allison commented on SOLR-10335: Thank you, [~shalinmangar]! Is it worth backporting to 6.6.2? > Upgrade to Tika 1.16 when available > --- > > Key: SOLR-10335 > URL: https://issues.apache.org/jira/browse/SOLR-10335 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tim Allison >Assignee: Shalin Shekhar Mangar >Priority: Minor > > Once POI 3.16-beta3 is out (early/mid April?), we'll push for a release of > Tika 1.15. > Please let us know if you have any requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202585#comment-16202585 ] Tim Allison commented on SOLR-11450: To get the directionality right...sorry. The issue I opened is a duplicate of LUCENE-7687...my error. > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11450) ComplexPhraseQParserPlugin not running charfilter for some multiterm queries in 6.x
[ https://issues.apache.org/jira/browse/SOLR-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202573#comment-16202573 ] Tim Allison commented on SOLR-11450: [~mikemccand] [~jpountz], any chance you'd be willing to review and push this into 6.6.2? > ComplexPhraseQParserPlugin not running charfilter for some multiterm queries > in 6.x > > > Key: SOLR-11450 > URL: https://issues.apache.org/jira/browse/SOLR-11450 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.6.1 >Reporter: Tim Allison >Priority: Minor > Attachments: SOLR-11450-unit-test.patch, SOLR-11450.patch > > > On the user list, [~bjarkebm] reported that the charfilter is not being > applied in PrefixQueries in the ComplexPhraseQParserPlugin in 6.x. Bjarke > fixed my proposed unit tests to prove this failure. All appears to work in > 7.x and trunk. If there are plans to release a 6.6.2, let's fold this in. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7687) ComplexPhraseQueryParser with AsciiFoldingFilterFactory (SOLR)
[ https://issues.apache.org/jira/browse/LUCENE-7687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202561#comment-16202561 ] Tim Allison commented on LUCENE-7687: - There's a patch available on SOLR-11450. This seems to have been fixed in 7.x > ComplexPhraseQueryParser with AsciiFoldingFilterFactory (SOLR) > -- > > Key: LUCENE-7687 > URL: https://issues.apache.org/jira/browse/LUCENE-7687 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 6.4.1 > Environment: solr-6.4.1 (yes, solr, but I don't know where the bug > exactly is) >Reporter: Jochen Barth > > I modified generic *_txt-Field type to use AsciiFoldingFilterFactory on query > & index. > When quering with > \{!complexphrase}text_txt:"König*" -- there are 0 results > \{!complexphrase}text_txt:"Konig*" -- there are >0 results > \{!complexphrase}text_txt:"König" -- there are >0 results (but less than the > line above) > and without \{!complexphrase} everything works o.k. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11462) TokenizerChain's normalize() doesn't work
[ https://issues.apache.org/jira/browse/SOLR-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199244#comment-16199244 ] Tim Allison commented on SOLR-11462: Could also submit PR for getting rid of TokenizerChain in favor of CustomAnalyzer. :) > TokenizerChain's normalize() doesn't work > - > > Key: SOLR-11462 > URL: https://issues.apache.org/jira/browse/SOLR-11462 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Trivial > > TokenizerChain's {{normalize()}} is not currently used so this doesn't > currently have any negative effects on search. However, there is a bug, and > we should fix it. > If applied to a TokenizerChain with {{filters.length > 1}}, only the last > would apply. > > {noformat} > @Override > protected TokenStream normalize(String fieldName, TokenStream in) { > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > return result; > } > {noformat} > The fix is trivial: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} > If you'd like to swap out {{TextField#analyzeMultiTerm()}} with, say: > {noformat} > public static BytesRef analyzeMultiTerm(String field, String part, Analyzer > analyzerIn) { > if (part == null || analyzerIn == null) return null; > return analyzerIn.normalize(field, part); > } > {noformat} > I'm happy to submit a PR with unit tests. Let me know. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5317) Concordance/Key Word In Context (KWIC) capability
[ https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199160#comment-16199160 ] Tim Allison edited comment on LUCENE-5317 at 10/10/17 6:40 PM: --- A prototype ASL 2.0 application that demonstrates the utility of the concordance is available: https://github.com/mitre/rhapsode was (Author: talli...@mitre.org): an prototype ASF 2.0 application that demonstrates the utility of the concordance is available: https://github.com/mitre/rhapsode > Concordance/Key Word In Context (KWIC) capability > - > > Key: LUCENE-5317 > URL: https://issues.apache.org/jira/browse/LUCENE-5317 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Affects Versions: 4.5 >Reporter: Tim Allison >Assignee: Tommaso Teofili > Labels: patch > Attachments: LUCENE-5317.patch, LUCENE-5317.patch, > concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch > > > This patch enables a Lucene-powered concordance search capability. > Concordances are extremely useful for linguists, lawyers and other analysts > performing analytic search vs. traditional snippeting/document retrieval > tasks. By "analytic search," I mean that the user wants to browse every time > a term appears (or at least the topn) in a subset of documents and see the > words before and after. > Concordance technology is far simpler and less interesting than IR relevance > models/methods, but it can be extremely useful for some use cases. > Traditional concordance sort orders are available (sort on words before the > target, words after, target then words before and target then words after). > Under the hood, this is running SpanQuery's getSpans() and reanalyzing to > obtain character offsets. There is plenty of room for optimizations and > refactoring. > Many thanks to my colleague, Jason Robinson, for input on the design of this > patch. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5317) Concordance/Key Word In Context (KWIC) capability
[ https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199160#comment-16199160 ] Tim Allison commented on LUCENE-5317: - an prototype ASF 2.0 application that demonstrates the utility of the concordance is available: https://github.com/mitre/rhapsode > Concordance/Key Word In Context (KWIC) capability > - > > Key: LUCENE-5317 > URL: https://issues.apache.org/jira/browse/LUCENE-5317 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search >Affects Versions: 4.5 >Reporter: Tim Allison >Assignee: Tommaso Teofili > Labels: patch > Attachments: LUCENE-5317.patch, LUCENE-5317.patch, > concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch > > > This patch enables a Lucene-powered concordance search capability. > Concordances are extremely useful for linguists, lawyers and other analysts > performing analytic search vs. traditional snippeting/document retrieval > tasks. By "analytic search," I mean that the user wants to browse every time > a term appears (or at least the topn) in a subset of documents and see the > words before and after. > Concordance technology is far simpler and less interesting than IR relevance > models/methods, but it can be extremely useful for some use cases. > Traditional concordance sort orders are available (sort on words before the > target, words after, target then words before and target then words after). > Under the hood, this is running SpanQuery's getSpans() and reanalyzing to > obtain character offsets. There is plenty of room for optimizations and > refactoring. > Many thanks to my colleague, Jason Robinson, for input on the design of this > patch. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11462) TokenizerChain's normalize() doesn't work
[ https://issues.apache.org/jira/browse/SOLR-11462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated SOLR-11462: --- Affects Version/s: master (8.0) > TokenizerChain's normalize() doesn't work > - > > Key: SOLR-11462 > URL: https://issues.apache.org/jira/browse/SOLR-11462 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: master (8.0) >Reporter: Tim Allison >Priority: Trivial > > TokenizerChain's {{normalize()}} is not currently used so this doesn't > currently have any negative effects on search. However, there is a bug, and > we should fix it. > If applied to a TokenizerChain with {{filters.length > 1}}, only the last > would apply. > > {noformat} > @Override > protected TokenStream normalize(String fieldName, TokenStream in) { > TokenStream result = in; > for (TokenFilterFactory filter : filters) { > if (filter instanceof MultiTermAwareComponent) { > filter = (TokenFilterFactory) ((MultiTermAwareComponent) > filter).getMultiTermComponent(); > result = filter.create(in); > } > } > return result; > } > {noformat} > The fix is trivial: > {noformat} > -result = filter.create(in); > +result = filter.create(result); > {noformat} > If you'd like to swap out {{TextField#analyzeMultiTerm()}} with, say: > {noformat} > public static BytesRef analyzeMultiTerm(String field, String part, Analyzer > analyzerIn) { > if (part == null || analyzerIn == null) return null; > return analyzerIn.normalize(field, part); > } > {noformat} > I'm happy to submit a PR with unit tests. Let me know. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-11462) TokenizerChain's normalize() doesn't work
Tim Allison created SOLR-11462: -- Summary: TokenizerChain's normalize() doesn't work Key: SOLR-11462 URL: https://issues.apache.org/jira/browse/SOLR-11462 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Tim Allison Priority: Trivial TokenizerChain's {{normalize()}} is not currently used so this doesn't currently have any negative effects on search. However, there is a bug, and we should fix it. If applied to a TokenizerChain with {{filters.length > 1}}, only the last would apply. {noformat} @Override protected TokenStream normalize(String fieldName, TokenStream in) { TokenStream result = in; for (TokenFilterFactory filter : filters) { if (filter instanceof MultiTermAwareComponent) { filter = (TokenFilterFactory) ((MultiTermAwareComponent) filter).getMultiTermComponent(); result = filter.create(in); } } return result; } {noformat} The fix is trivial: {noformat} -result = filter.create(in); +result = filter.create(result); {noformat} If you'd like to swap out {{TextField#analyzeMultiTerm()}} with, say: {noformat} public static BytesRef analyzeMultiTerm(String field, String part, Analyzer analyzerIn) { if (part == null || analyzerIn == null) return null; return analyzerIn.normalize(field, part); } {noformat} I'm happy to submit a PR with unit tests. Let me know. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org