Re: GIT does not support empty directories
Seriously? We should hack our ant files around the bugs in every crappy source control system that comes out? Fix Git. On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.orgwrote: I've run into this too. I don't think this needs to be documented, I think it needs to be *fixed* -- that is, the relevant ant tasks need to not assume these directories exist and create them if not. ~ David Smiley -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Wednesday, April 14, 2010 11:14 PM To: solr-dev Subject: GIT does not support empty directories There are some empty directories in the Solr source tree, both in 1.4 and the trunk. example/work example/webapp example/logs Git does not support empty directories: https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F And so, when you check out from the Apache GIT repository, these empty directories do not appear and 'ant example' and 'ant run-example' fail. There is no 'how to use the solr git stuff' wiki page; that seems like the right place to document this. I'm not git-smart enough to write that page. -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com
Re: GIT does not support empty directories
I don't like the idea of complicating lucene/solr's build system any more than it already is, unless its absolutely necessary. its already too complicated. Instead of adding more hacks, what is actually broken (git) is what should be fixed, as the link states: Currently the design of the git index (staging area) only permits *files* to be listed, and nobody competent enough to make the change to allow empty directories has cared enough about this situation to remedy it. On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. dsmi...@mitre.orgwrote: Seriously. I sympathize with your point that git should support empty directories. But as a practical matter, it's easy to make the ant build tolerant of them. ~ David Smiley From: Robert Muir [rcm...@gmail.com] Sent: Friday, April 16, 2010 6:53 AM To: solr-dev@lucene.apache.org Subject: Re: GIT does not support empty directories Seriously? We should hack our ant files around the bugs in every crappy source control system that comes out? Fix Git. On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.org wrote: I've run into this too. I don't think this needs to be documented, I think it needs to be *fixed* -- that is, the relevant ant tasks need to not assume these directories exist and create them if not. ~ David Smiley -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Wednesday, April 14, 2010 11:14 PM To: solr-dev Subject: GIT does not support empty directories There are some empty directories in the Solr source tree, both in 1.4 and the trunk. example/work example/webapp example/logs Git does not support empty directories: https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F And so, when you check out from the Apache GIT repository, these empty directories do not appear and 'ant example' and 'ant run-example' fail. There is no 'how to use the solr git stuff' wiki page; that seems like the right place to document this. I'm not git-smart enough to write that page. -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Eclipse project files...
On Mon, Apr 12, 2010 at 5:15 AM, Paolo Castagna castagna.li...@googlemail.com wrote: For Lucene, I needed two more jars from Ant project: - ant-1.7.1.jar - ant-junit-1.7.1.jar Paolo, I put these in the lib directory now, to hopefully make IDE configuration easier. By the way, thanks for your ideas here. I think its worth our time to try to make Lucene/Solr as easy as possible for someone to bring up in their IDE, or we scare people away... -- Robert Muir rcm...@gmail.com
[jira] Assigned: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute
[ https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-1876: - Assignee: Robert Muir Convert all tokenstreams and tests to use CharTermAttribute --- Key: SOLR-1876 URL: https://issues.apache.org/jira/browse/SOLR-1876 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1876.patch See the improvements in LUCENE-2302. TermAttribute has been deprecated for flexible indexing, as terms can really be anything, as long as they can be serialized to byte[]. For character-terms, a CharTermAttribute has been created, with a more friendly API. Additionally this attribute implements the CharSequence and Appendable interfaces. We should convert all Solr tokenstreams to use this new attribute. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Assigned: (SOLR-1874) optimize patternreplacefilter
[ https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-1874: - Assignee: Robert Muir optimize patternreplacefilter - Key: SOLR-1874 URL: https://issues.apache.org/jira/browse/SOLR-1874 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1874.patch We can optimize PatternReplaceFilter: * don't need to create Strings since CharTermAttribute implements CharSequence, just match directly against it. * reuse the matcher, since CharTermAttribute is reused, too. * don't create Strings/waste time in replaceAll/replaceFirst if the term doesn't match the regex at all... check with find() first. There is more that could be done to make it faster for terms that do match, but this is simple and a start. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (SOLR-1874) optimize patternreplacefilter
[ https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1874. --- Resolution: Fixed Committed revision 932752. optimize patternreplacefilter - Key: SOLR-1874 URL: https://issues.apache.org/jira/browse/SOLR-1874 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1874.patch We can optimize PatternReplaceFilter: * don't need to create Strings since CharTermAttribute implements CharSequence, just match directly against it. * reuse the matcher, since CharTermAttribute is reused, too. * don't create Strings/waste time in replaceAll/replaceFirst if the term doesn't match the regex at all... check with find() first. There is more that could be done to make it faster for terms that do match, but this is simple and a start. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute
Convert all tokenstreams and tests to use CharTermAttribute --- Key: SOLR-1876 URL: https://issues.apache.org/jira/browse/SOLR-1876 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Fix For: 3.1 See the improvements in LUCENE-2302. TermAttribute has been deprecated for flexible indexing, as terms can really be anything, as long as they can be serialized to byte[]. For character-terms, a CharTermAttribute has been created, with a more friendly API. Additionally this attribute implements the CharSequence and Appendable interfaces. We should convert all Solr tokenstreams to use this new attribute. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (SOLR-1876) Convert all tokenstreams and tests to use CharTermAttribute
[ https://issues.apache.org/jira/browse/SOLR-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1876: -- Attachment: SOLR-1876.patch This patch does the following: * Converts all tokenstreams to use CharTermAttribute * Makes all non-final concrete TokenStreams and Analyzers final (see LUCENE-2389) * enables both lucene and solr assertions when running solr core and contrib tests (previously disabled!) All tests pass, and also pass with the additional assertions if you apply LUCENE-2389 Convert all tokenstreams and tests to use CharTermAttribute --- Key: SOLR-1876 URL: https://issues.apache.org/jira/browse/SOLR-1876 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Fix For: 3.1 Attachments: SOLR-1876.patch See the improvements in LUCENE-2302. TermAttribute has been deprecated for flexible indexing, as terms can really be anything, as long as they can be serialized to byte[]. For character-terms, a CharTermAttribute has been created, with a more friendly API. Additionally this attribute implements the CharSequence and Appendable interfaces. We should convert all Solr tokenstreams to use this new attribute. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (SOLR-1874) optimize patternreplacefilter
optimize patternreplacefilter - Key: SOLR-1874 URL: https://issues.apache.org/jira/browse/SOLR-1874 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Fix For: 3.1 Attachments: SOLR-1874.patch We can optimize PatternReplaceFilter: * don't need to create Strings since CharTermAttribute implements CharSequence, just match directly against it. * reuse the matcher, since CharTermAttribute is reused, too. * don't create Strings/waste time in replaceAll/replaceFirst if the term doesn't match the regex at all... check with find() first. There is more that could be done to make it faster for terms that do match, but this is simple and a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1874) optimize patternreplacefilter
[ https://issues.apache.org/jira/browse/SOLR-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1874: -- Attachment: SOLR-1874.patch optimize patternreplacefilter - Key: SOLR-1874 URL: https://issues.apache.org/jira/browse/SOLR-1874 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Fix For: 3.1 Attachments: SOLR-1874.patch We can optimize PatternReplaceFilter: * don't need to create Strings since CharTermAttribute implements CharSequence, just match directly against it. * reuse the matcher, since CharTermAttribute is reused, too. * don't create Strings/waste time in replaceAll/replaceFirst if the term doesn't match the regex at all... check with find() first. There is more that could be done to make it faster for terms that do match, but this is simple and a start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour
[ https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854983#action_12854983 ] Robert Muir commented on SOLR-1869: --- bq. this all started because the highlighter was highlighting a term at the same offsets twice, Perhaps we should fix this directly in DefaultSolrHighlighter? It already has this TokenStream-sorting filter thats intended to do the following: {code} /** Orders Tokens in a window first by their startOffset ascending. * endOffset is currently ignored. * This is meant to work around fickleness in the highlighter only. It * can mess up token positions and should not be used for indexing or querying. */ {code} Maybe the deduplication logic should occur here after it sorts on startOffset? RemoveDuplicatesTokenFilter doest have expected behaviour - Key: SOLR-1869 URL: https://issues.apache.org/jira/browse/SOLR-1869 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Joe Calderon Priority: Minor Attachments: RemoveDupOffsetTokenFilter.java, RemoveDupOffsetTokenFilterFactory.java, SOLR-1869.patch the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor in addition i would think the expected behaviour would be to remove identical terms with the same offset positions, instead it looks like it removes duplicates based on position increment which wont work when using it after something like the edgengram filter. when i posted this to the mailing list even erik hatcher seemed to think thats what this filter was supposed to do... attaching a patch that has the expected behaviour and initializes variables in constructor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour
[ https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854676#action_12854676 ] Robert Muir commented on SOLR-1869: --- Joe, the initialization is the same. I simply prefer to do this right where the attribute is declared, rather than doing it in the ctor (its the same in java!). So this is no problem. as far as the behavior, the filter is currently correct: {noformat} A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream. {noformat} if you want to instead create a filter that removes duplicates across an entire field, this is really a completely different filter, but it sounds like a useful completely different filter! Can you instead create a patch for a separate filter with a different name? I think you can start with this patch, but there are a number of issues with this patch though: * the map/set is never cleared, so it won't work across reusable tokenstreams. The map/set should be cleared in reset() * i would use chararrayset instead of this map, like the current RemoveDuplicatesTokenFilter RemoveDuplicatesTokenFilter doest have expected behaviour - Key: SOLR-1869 URL: https://issues.apache.org/jira/browse/SOLR-1869 Project: Solr Issue Type: Bug Components: Schema and Analysis Reporter: Joe Calderon Priority: Minor Attachments: SOLR-1869.patch the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor in addition i would think the expected behaviour would be to remove identical terms with the same offset positions, instead it looks like it removes duplicates based on position increment which wont work when using it after something like the edgengram filter. when i posted this to the mailing list even erik hatcher seemed to think thats what this filter was supposed to do... attaching a patch that has the expected behaviour and initializes variables in constructor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour
[ https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1869: -- Issue Type: New Feature (was: Bug) RemoveDuplicatesTokenFilter doest have expected behaviour - Key: SOLR-1869 URL: https://issues.apache.org/jira/browse/SOLR-1869 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Joe Calderon Priority: Minor Attachments: SOLR-1869.patch the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor in addition i would think the expected behaviour would be to remove identical terms with the same offset positions, instead it looks like it removes duplicates based on position increment which wont work when using it after something like the edgengram filter. when i posted this to the mailing list even erik hatcher seemed to think thats what this filter was supposed to do... attaching a patch that has the expected behaviour and initializes variables in constructor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour
[ https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854712#action_12854712 ] Robert Muir commented on SOLR-1869: --- bq. The CharArrayMap is more performant in lookup, but you are right, we may need posincr. we don't need it for the current implementation, as we clear() the chararrayset when we encounter a term of posincr 0. so the set is only a set of seen terms at some position. RemoveDuplicatesTokenFilter doest have expected behaviour - Key: SOLR-1869 URL: https://issues.apache.org/jira/browse/SOLR-1869 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Joe Calderon Priority: Minor Attachments: SOLR-1869.patch the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor in addition i would think the expected behaviour would be to remove identical terms with the same offset positions, instead it looks like it removes duplicates based on position increment which wont work when using it after something like the edgengram filter. when i posted this to the mailing list even erik hatcher seemed to think thats what this filter was supposed to do... attaching a patch that has the expected behaviour and initializes variables in constructor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1865) ignore byte-order markers in SolrResourceLoader
ignore byte-order markers in SolrResourceLoader --- Key: SOLR-1865 URL: https://issues.apache.org/jira/browse/SOLR-1865 Project: Solr Issue Type: Improvement Reporter: Robert Muir Priority: Minor Fix For: 3.1 Attachments: SOLR-1865.patch If you create say a stopwords list with windows notepad or other editors and save as UTF-8, some of these editors will insert a byte-order marker (zero-width no-break space) as the first character of the file. http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1865) ignore byte-order markers in SolrResourceLoader
[ https://issues.apache.org/jira/browse/SOLR-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1865: -- Attachment: SOLR-1865.patch attached is a patch to ignore BOM's at the beginning of files loaded with getLines() ignore byte-order markers in SolrResourceLoader --- Key: SOLR-1865 URL: https://issues.apache.org/jira/browse/SOLR-1865 Project: Solr Issue Type: Improvement Reporter: Robert Muir Priority: Minor Fix For: 3.1 Attachments: SOLR-1865.patch If you create say a stopwords list with windows notepad or other editors and save as UTF-8, some of these editors will insert a byte-order marker (zero-width no-break space) as the first character of the file. http://www.lucidimagination.com/search/document/5101871231fc95af/is_this_a_bug_of_the_ressourceloader -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1860) improve stopwords list handling
[ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853684#action_12853684 ] Robert Muir commented on SOLR-1860: --- bq. Either we can setup a simple export and conversion to the format Solr currently supports now, and if/when someon updates StopFilterFactory to support the new format, then we can stop converting when we export Well, this isn't that big of a deal either way. In Lucene we have a helper class called WordListLoader that supports loading this format from an InputStream. One idea to consider: we could try merging some of what SolrResourceLoader does with this WordListLoader, then its all tested and in one place. it appears there might be some duplication of effort here... e.g. how long till a lucene user complains about UTF-8 bom markers in their stoplists :) We can still use ant to keep the files in sync automatically from the lucene copies. improve stopwords list handling --- Key: SOLR-1860 URL: https://issues.apache.org/jira/browse/SOLR-1860 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter. Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers. So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams. The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this). Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers. There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?) 1. The user would specify something like: filter class=solr.StopFilterFactory fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../ This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded. 2. We add support for snowball-formatted stopwords lists, and the user could something like: filter class=solr.StopFilterFactory words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball ... / The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish stopword lists to go along with their stemmers, so we had to add our own. Let me know what you guys think, and I will create a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852811#action_12852811 ] Robert Muir commented on SOLR-1852: --- Committed the test to trunk: revision 930262. enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Assignee: Robert Muir Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1860) improve stopwords list handling
[ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852978#action_12852978 ] Robert Muir commented on SOLR-1860: --- A third idea from Hoss Man: We should make it easy to edit these lists like english. So an idea is to create an intl/ folder or similar under the example with stopwords_fr.txt, stopwords_de.txt Additionally we could have a schema-intl.xml with example types 'text_fr', 'text_de', etc setup for various languages. I like this idea best. improve stopwords list handling --- Key: SOLR-1860 URL: https://issues.apache.org/jira/browse/SOLR-1860 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter. Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers. So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams. The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this). Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers. There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?) 1. The user would specify something like: filter class=solr.StopFilterFactory fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../ This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded. 2. We add support for snowball-formatted stopwords lists, and the user could something like: filter class=solr.StopFilterFactory words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball ... / The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish stopword lists to go along with their stemmers, so we had to add our own. Let me know what you guys think, and I will create a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1859) speed up indexing for example schema
[ https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852375#action_12852375 ] Robert Muir commented on SOLR-1859: --- Any objections? If not I would like to commit later today. Thanks! speed up indexing for example schema Key: SOLR-1859 URL: https://issues.apache.org/jira/browse/SOLR-1859 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1859.patch The example schema should use the lucene core PorterStemmer (coded in Java by Martin Porter) instead of the Snowball one that is auto-generated code. Although we have sped up the Snowball stemmer, its still pretty slow and the example should be fast. Below is the output of ant test -Dtestcase=TestIndexingPerformance -Dargs=-server -Diter=10 These results are consistent with large document indexing times that I have seen on large english collections with Lucene, we double indexing speed. {noformat} solr1.5branch: iter=10 time=5841 throughput=17120 iter=10 time=5839 throughput=17126 iter=10 time=6017 throughput=16619 trunk (unpatched): iter=10 time=4132 throughput=24201 iter=10 time=4142 throughput=24142 iter=10 time=4151 throughput=24090 trunk (patched) iter=10 time=2998 throughput=33355 iter=10 time=3021 throughput=33101 iter=10 time=3006 throughput=33266 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1859) speed up indexing for example schema
[ https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1859. --- Resolution: Fixed Committed revision 930050. speed up indexing for example schema Key: SOLR-1859 URL: https://issues.apache.org/jira/browse/SOLR-1859 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1859.patch The example schema should use the lucene core PorterStemmer (coded in Java by Martin Porter) instead of the Snowball one that is auto-generated code. Although we have sped up the Snowball stemmer, its still pretty slow and the example should be fast. Below is the output of ant test -Dtestcase=TestIndexingPerformance -Dargs=-server -Diter=10 These results are consistent with large document indexing times that I have seen on large english collections with Lucene, we double indexing speed. {noformat} solr1.5branch: iter=10 time=5841 throughput=17120 iter=10 time=5839 throughput=17126 iter=10 time=6017 throughput=16619 trunk (unpatched): iter=10 time=4132 throughput=24201 iter=10 time=4142 throughput=24142 iter=10 time=4151 throughput=24090 trunk (patched) iter=10 time=2998 throughput=33355 iter=10 time=3021 throughput=33101 iter=10 time=3006 throughput=33266 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1860) improve stopwords list handling
improve stopwords list handling --- Key: SOLR-1860 URL: https://issues.apache.org/jira/browse/SOLR-1860 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter. Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers. So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams. The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this). Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers. There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?) 1. The user would specify something like: filter class=solr.StopFilterFactory fromAnalyzer=org.apache.lucene.analysis.FrenchAnalyzer .../ This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded. 2. We add support for snowball-formatted stopwords lists, and the user could something like: filter class=solr.StopFilterFactory words=org/apache/lucene/analysis/snowball/french_stop.txt format=snowball ... / The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish stopword lists to go along with their stemmers, so we had to add our own. Let me know what you guys think, and I will create a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-1740) ShingleFilterFactory improvements
[ https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-1740: - Assignee: Robert Muir ShingleFilterFactory improvements - Key: SOLR-1740 URL: https://issues.apache.org/jira/browse/SOLR-1740 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 1.5 Reporter: Steven Rowe Assignee: Robert Muir Priority: Minor Attachments: SOLR-1740.patch ShingleFilterFactory should allow specification of minimum shingle size (in addition to maximum shingle size), as well as the separator to use between tokens. These are implemented at LUCENE-2218. The attached patch allows ShingleFilterFactory to accept configuration of these items, and includes tests against the new functionality in TestShingleFilterFactory. Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached patch will apply. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1740) ShingleFilterFactory improvements
[ https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852686#action_12852686 ] Robert Muir commented on SOLR-1740: --- Now that we are on Lucene 3.1, it seems like it would be useful to add these new capabilities to the factory? ShingleFilterFactory improvements - Key: SOLR-1740 URL: https://issues.apache.org/jira/browse/SOLR-1740 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 1.5 Reporter: Steven Rowe Assignee: Robert Muir Priority: Minor Attachments: SOLR-1740.patch ShingleFilterFactory should allow specification of minimum shingle size (in addition to maximum shingle size), as well as the separator to use between tokens. These are implemented at LUCENE-2218. The attached patch allows ShingleFilterFactory to accept configuration of these items, and includes tests against the new functionality in TestShingleFilterFactory. Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached patch will apply. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1312) BufferedTokenStream should use new Lucene 2.9 TokenStream API
[ https://issues.apache.org/jira/browse/SOLR-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852687#action_12852687 ] Robert Muir commented on SOLR-1312: --- Hello, I recommend we cancel this issue. No Solr tokenstreams extend this BufferedTokenStream API anymore, as it is bound to Token and does not support reuse. Currently this class is marked deprecated in trunk, with a backwards compatibility layer. If we think that an API like this is useful, we should make a new BufferedTokenStream-like API that uses AttributeSource instead of Token, but this API would not support reuse and would not be very performant, as it would have to use cloneAttributes() and copyTo() instead of captureState() and restoreState() BufferedTokenStream should use new Lucene 2.9 TokenStream API - Key: SOLR-1312 URL: https://issues.apache.org/jira/browse/SOLR-1312 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 1.4 Reporter: Tom Burton-West Priority: Minor Since Solr 1.4 will be using Lucene 2.9, the Solr TokenFilters should probably be updated to use the Lucene 2.9 TokenStream API. This issue is to put BufferedTokenStream on the list of Filters that need updating. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1740) ShingleFilterFactory improvements
[ https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1740: -- Attachment: SOLR-1740.patch Steven's patch, synced to trunk. I plan to commit shortly, thanks for the configuration tests Steven. ShingleFilterFactory improvements - Key: SOLR-1740 URL: https://issues.apache.org/jira/browse/SOLR-1740 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 1.5 Reporter: Steven Rowe Assignee: Robert Muir Priority: Minor Attachments: SOLR-1740.patch, SOLR-1740.patch ShingleFilterFactory should allow specification of minimum shingle size (in addition to maximum shingle size), as well as the separator to use between tokens. These are implemented at LUCENE-2218. The attached patch allows ShingleFilterFactory to accept configuration of these items, and includes tests against the new functionality in TestShingleFilterFactory. Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached patch will apply. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1740) ShingleFilterFactory improvements
[ https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1740: -- Affects Version/s: (was: 1.5) 3.1 Fix Version/s: 3.1 ShingleFilterFactory improvements - Key: SOLR-1740 URL: https://issues.apache.org/jira/browse/SOLR-1740 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Steven Rowe Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: SOLR-1740.patch, SOLR-1740.patch ShingleFilterFactory should allow specification of minimum shingle size (in addition to maximum shingle size), as well as the separator to use between tokens. These are implemented at LUCENE-2218. The attached patch allows ShingleFilterFactory to accept configuration of these items, and includes tests against the new functionality in TestShingleFilterFactory. Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached patch will apply. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1740) ShingleFilterFactory improvements
[ https://issues.apache.org/jira/browse/SOLR-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1740. --- Resolution: Fixed Committed revision 930163. Thanks Steven! ShingleFilterFactory improvements - Key: SOLR-1740 URL: https://issues.apache.org/jira/browse/SOLR-1740 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.1 Reporter: Steven Rowe Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: SOLR-1740.patch, SOLR-1740.patch ShingleFilterFactory should allow specification of minimum shingle size (in addition to maximum shingle size), as well as the separator to use between tokens. These are implemented at LUCENE-2218. The attached patch allows ShingleFilterFactory to accept configuration of these items, and includes tests against the new functionality in TestShingleFilterFactory. Solr will have to upgrade to lucene-analyzers-3.1-dev.jar before the attached patch will apply. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1857) cleanup and sync analysis with lucene trunk
cleanup and sync analysis with lucene trunk --- Key: SOLR-1857 URL: https://issues.apache.org/jira/browse/SOLR-1857 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Fix For: 3.1 Solr works on the lucene trunk, but uses a lot of deprecated APIs. Additionally two factories are missing, the Keyword and StemmerOverride filters. The code can be improved with 3.x's generics support, removing casts, etc. Finally there is some code duplication with lucene, and some cleanup (such as deprecating factories for stuff thats deprecated in trunk). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1857) cleanup and sync analysis with lucene trunk
[ https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1857: -- Attachment: SOLR-1857.patch attached is a regrettably large patch to sync us up, and clean things up a bit. this removes all use of deprecated lucene APIs, except via things that are now deprecated in Solr itself. All tests pass. cleanup and sync analysis with lucene trunk --- Key: SOLR-1857 URL: https://issues.apache.org/jira/browse/SOLR-1857 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Fix For: 3.1 Attachments: SOLR-1857.patch Solr works on the lucene trunk, but uses a lot of deprecated APIs. Additionally two factories are missing, the Keyword and StemmerOverride filters. The code can be improved with 3.x's generics support, removing casts, etc. Finally there is some code duplication with lucene, and some cleanup (such as deprecating factories for stuff thats deprecated in trunk). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1857) cleanup and sync analysis with lucene trunk
[ https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852079#action_12852079 ] Robert Muir commented on SOLR-1857: --- if no one objects, I would like to commit in a day or two. If anyone wants to review, thats great... i know its large... cleanup and sync analysis with lucene trunk --- Key: SOLR-1857 URL: https://issues.apache.org/jira/browse/SOLR-1857 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1857.patch Solr works on the lucene trunk, but uses a lot of deprecated APIs. Additionally two factories are missing, the Keyword and StemmerOverride filters. The code can be improved with 3.x's generics support, removing casts, etc. Finally there is some code duplication with lucene, and some cleanup (such as deprecating factories for stuff thats deprecated in trunk). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-1857) cleanup and sync analysis with lucene trunk
[ https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-1857: - Assignee: Robert Muir cleanup and sync analysis with lucene trunk --- Key: SOLR-1857 URL: https://issues.apache.org/jira/browse/SOLR-1857 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1857.patch Solr works on the lucene trunk, but uses a lot of deprecated APIs. Additionally two factories are missing, the Keyword and StemmerOverride filters. The code can be improved with 3.x's generics support, removing casts, etc. Finally there is some code duplication with lucene, and some cleanup (such as deprecating factories for stuff thats deprecated in trunk). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1857) cleanup and sync analysis with lucene trunk
[ https://issues.apache.org/jira/browse/SOLR-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852213#action_12852213 ] Robert Muir commented on SOLR-1857: --- bq. I just did a 5 min review, not line-by-line, but seems fine in general. Thanks for the review Yonik, I'll move forward then and commit soon... I'll open an issue next for the default schema speedups... looking forward to this :) cleanup and sync analysis with lucene trunk --- Key: SOLR-1857 URL: https://issues.apache.org/jira/browse/SOLR-1857 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1857.patch Solr works on the lucene trunk, but uses a lot of deprecated APIs. Additionally two factories are missing, the Keyword and StemmerOverride filters. The code can be improved with 3.x's generics support, removing casts, etc. Finally there is some code duplication with lucene, and some cleanup (such as deprecating factories for stuff thats deprecated in trunk). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-1852: - Assignee: Robert Muir enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Assignee: Robert Muir Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852216#action_12852216 ] Robert Muir commented on SOLR-1852: --- I'm afraid of WDF, but I don't think I am the only one, and I think it would be good to fix this bug. If no one objects, I'd like to commit these patches (testcase and backport the trunk filter) to the 1.5 branch in a few days. enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Assignee: Robert Muir Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852234#action_12852234 ] Robert Muir commented on SOLR-1852: --- Peter it is... but admittedly it has not been in trunk for very long, and WDF is pretty complex. It's a bit scary to backport a rewrite of it for this reason, but at the same time, we've got this bug and the other config bugs found in SOLR-1706, so I think its the right thing to do... enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Assignee: Robert Muir Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1859) speed up indexing for example schema
speed up indexing for example schema Key: SOLR-1859 URL: https://issues.apache.org/jira/browse/SOLR-1859 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 The example schema should use the lucene core PorterStemmer (coded in Java by Martin Porter) instead of the Snowball one that is auto-generated code. Although we have sped up the Snowball stemmer, its still pretty slow and the example should be fast. Below is the output of ant test -Dtestcase=TestIndexingPerformance -Dargs=-server -Diter=10 These results are consistent with large document indexing times that I have seen on large english collections with Lucene, we double indexing speed. {noformat} solr1.5branch: iter=10 time=5841 throughput=17120 iter=10 time=5839 throughput=17126 iter=10 time=6017 throughput=16619 trunk (unpatched): iter=10 time=4132 throughput=24201 iter=10 time=4142 throughput=24142 iter=10 time=4151 throughput=24090 trunk (patched) iter=10 time=2998 throughput=33355 iter=10 time=3021 throughput=33101 iter=10 time=3006 throughput=33266 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1859) speed up indexing for example schema
[ https://issues.apache.org/jira/browse/SOLR-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1859: -- Attachment: SOLR-1859.patch attached is a patch. I fixed every instance for general types like text in every schema file i could find, including test ones, and commented-out instances, too. All tests pass. speed up indexing for example schema Key: SOLR-1859 URL: https://issues.apache.org/jira/browse/SOLR-1859 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: SOLR-1859.patch The example schema should use the lucene core PorterStemmer (coded in Java by Martin Porter) instead of the Snowball one that is auto-generated code. Although we have sped up the Snowball stemmer, its still pretty slow and the example should be fast. Below is the output of ant test -Dtestcase=TestIndexingPerformance -Dargs=-server -Diter=10 These results are consistent with large document indexing times that I have seen on large english collections with Lucene, we double indexing speed. {noformat} solr1.5branch: iter=10 time=5841 throughput=17120 iter=10 time=5839 throughput=17126 iter=10 time=6017 throughput=16619 trunk (unpatched): iter=10 time=4132 throughput=24201 iter=10 time=4142 throughput=24142 iter=10 time=4151 throughput=24090 trunk (patched) iter=10 time=2998 throughput=33355 iter=10 time=3021 throughput=33101 iter=10 time=3006 throughput=33266 {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
protwords.txt support in stemmers
Hello Solr devs, One thing we did recently in lucene that I would like to expose in Solr, is add support for protected words to all stemmers. So the way this works is that a TokenStream attribute 'KeywordAttribute' is set, and all the stemfilters know to ignore tokens with this boolean value set. We also added two neat tokenfilters that make this easy to use: * KeywordMarkerFilter: a tokenfilter, that given a set of input words, marks them as keywords with this attribute so any later stemmer ignores them. * StemmerOverrideFilter: a tokenfilter, that given a map of input words-stems, stems them with the dictionary, and marks them as keywords so any later stemmer ignores them. We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a keywordmarkerfilter internally. * we could deprecate the explicit protwords.txt in the few factories that support it, and instead create a factory for KeywordMarkerFilter. * we could do something else, e.g. both. So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user could do: filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SomeStemmer/ and get the same effect, instead of having to add support for protwords.txt to every single stem factory. I don't really have a personal preference as to how we do it, but it would be cool to have a plan so we can add these factories and clean a few things up. In any event I think we should add a factory for the StemmerOverrideFilter, so someone can have a text file with exceptions, the dutch handling for fiets comes to mind. Thanks -- Robert Muir rcm...@gmail.com
Re: protwords.txt support in stemmers
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: It would also be nice to make the token categories generated by tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A tokenizer that detected many of the properties could significantly speed up analysis because tokens would not have to be re-analyzed to see if they contain mixed case, numbers, hyphens, etc (i.e. the fast path for WDF would be checking a bit per token). I like this idea, but it does seem a little bit dangerous. e.g. the tokenizer could set one of these values, but if some tokenfilter down the stream doesnt properly use it, you could introduce bugs (by assuming a word has no numbers when in fact it now does, due to say, a PatternReplaceFilter) so i think we would simply end up adding a lot of these redundant checks back, e.g. you would have to re-analyze the term after any regex replacement from PatternReplaceFilter to properly set these flags... and it might introduce a lot of subtle bugs. -- Robert Muir rcm...@gmail.com
Re: protwords.txt support in stemmers
On Tue, Mar 30, 2010 at 8:33 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir rcm...@gmail.com wrote: We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a keywordmarkerfilter internally. * we could deprecate the explicit protwords.txt in the few factories that support it, and instead create a factory for KeywordMarkerFilter. * we could do something else, e.g. both. So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user could do: filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.SomeStemmer/ and get the same effect, instead of having to add support for protwords.txt to every single stem factory. Yep, this decomposition seems more powerful. Sort of related: for a long time I've had the idea of allowing the expression of more complex filter chains that can conditionally execute some parts based on tags set by other parts. This is straightforward to just hand-code in Java of course, but trickier to do well in a declarative setting: filter class=solr.Tagger tag=protect words=protwords.txt/ filter class=solr.SomeStemmer skipTags=protect/ The idea was to also make this fast by allocating a bit per tag (assuming we somehow knew all of the possible ones in a particular filter chain) and using a bitfield (long) to set and test. I was planning on using Token.flags before the new analysis attribute stuff came into being. It would also be nice to make the token categories generated by tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A tokenizer that detected many of the properties could significantly speed up analysis because tokens would not have to be re-analyzed to see if they contain mixed case, numbers, hyphens, etc (i.e. the fast path for WDF would be checking a bit per token). Anyway, probably something for another day, but I wanted to throw it out there. -Yonik http://www.lucidimagination.com Sorta unrelated too, but on the same topic of performance, I'd really like to improve the indexing speed with the example schema, and thats my hidden motivation here. I think we've already significantly improved WDF and SnowballPorter performance in trunk, but if we add this support we could at least consider switching to the much much faster PorterStemmer in the Lucene core for the example schema, as it would then support protected words via this mechanism. Do you have a preferred way to benchmark type text for example? Ideally in the future the lucene benchmark package could support benchmarking Solr schema definitions... but we aren't there yet! -- Robert Muir rcm...@gmail.com
Re: protwords.txt support in stemmers
On Tue, Mar 30, 2010 at 10:32 AM, Yonik Seeley yo...@lucidimagination.comwrote: Unfortunately not... it's normally something ad hoc like uploading a big CSV file, etc. There's also the very simplistic TestIndexingPerformance. ant test -Dtestcase=TestIndexingPerformance -Dargs=-server -Diter=10; grep throughput build/test-results/*TestIndexingPerformance* Cool, as a quick stab at this, I ran this 3 times on solr 1.5, solr trunk, and solr trunk with the proposed mod: The results are consistent with what I have seen indexing large docs with just lucene, too. solr1.5branch: iter=10 time=5841 throughput=17120 iter=10 time=5839 throughput=17126 iter=10 time=6017 throughput=16619 trunk: iter=10 time=4132 throughput=24201 iter=10 time=4142 throughput=24142 iter=10 time=4151 throughput=24090 trunk: swap Snowball Porter with Core Lucene Porter iter=10 time=2978 throughput=33579 iter=10 time=2973 throughput=33636 iter=10 time=2925 throughput=34188 -- Robert Muir rcm...@gmail.com
[jira] Updated: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1852: -- Attachment: SOLR-1852_testcase.patch attached is a testcase demonstrating the bug. The problem is that if you have, for example the lucene.solr, where the is a stopword, the Solr 1.4 WordDelimiter bumps the position increment of *both* lucene and solr tokens: * lucene (posInc=2) * solr (posInc=2) * lucenesolr (posInc=0) Instead it should look like: * lucene (posInc=2) * solr (posInc=1) * lucenesolr (posInc=0) In my opinion the behavior of trunk is correct, and this is a bug. But I don't know how to fix just Solr 1.4's WDF in a better way than dropping in the entire rewritten WDF... enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Attachments: SOLR-1852.patch, SOLR-1852_testcase.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1710) convert worddelimiterfilter to new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1710. --- Resolution: Fixed Fix Version/s: 3.1 Assignee: Mark Miller This was resolved in revision 922957. convert worddelimiterfilter to new tokenstream API -- Key: SOLR-1710 URL: https://issues.apache.org/jira/browse/SOLR-1710 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Robert Muir Assignee: Mark Miller Fix For: 3.1 Attachments: SOLR-1710-readable.patch, SOLR-1710-readable.patch, SOLR-1710.patch, SOLR-1710.patch This one was a doozy, attached is a patch to convert it to the new tokenstream API. Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords) the filter is much more efficient now, no cloning. before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations. For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters. NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1657. --- Resolution: Fixed Fix Version/s: 3.1 Assignee: Mark Miller This was resolved in revision 922957. convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Assignee: Mark Miller Fix For: 3.1 Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657_part2.patch, SOLR-1657_synonyms_ugly_slightly_less_slow.patch, SOLR-1657_synonyms_ugly_slow.patch org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- -org.apache.solr.handler:- -AnalysisRequestHandler- -AnalysisRequestHandlerBase- -org.apache.solr.handler.component:- -QueryElevationComponent- -SpellCheckComponent- -org.apache.solr.highlight:- -DefaultSolrHighlighter- -org.apache.solr.spelling:- -SpellingQueryConverter- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options
[ https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1706. --- Resolution: Fixed Fix Version/s: 3.1 Assignee: Mark Miller This was resolved in revision 922957. wrong tokens output from WordDelimiterFilter depending upon options --- Key: SOLR-1706 URL: https://issues.apache.org/jira/browse/SOLR-1706 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Mark Miller Fix For: 3.1 below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way. {code} assertWdf(Super-Duper-XL500-42-AutoCoder's, 0,0,0,1,0,0,0,0,1, null, new String[] { 42, AutoCoder }, new int[] { 18, 21 }, new int[] { 20, 30 }, new int[] { 1, 1 }); assertWdf(Super-Duper-XL500-42-AutoCoder's-56, 0,0,0,1,0,0,0,0,1, null, new String[] { 42, AutoCoder, 56 }, new int[] { 18, 21, 33 }, new int[] { 20, 30, 35 }, new int[] { 1, 1, 1 }); assertWdf(Super-Duper-XL500-AB-AutoCoder's, 0,0,0,1,0,0,0,0,1, null, new String[] { }, new int[] { }, new int[] { }, new int[] { }); assertWdf(Super-Duper-XL500-42-AutoCoder's-BC, 0,0,0,1,0,0,0,0,1, null, new String[] { 42 }, new int[] { 18 }, new int[] { 20 }, new int[] { 1 }); {code} where assertWdf is {code} void assertWdf(String text, int generateWordParts, int generateNumberParts, int catenateWords, int catenateNumbers, int catenateAll, int splitOnCaseChange, int preserveOriginal, int splitOnNumerics, int stemEnglishPossessive, CharArraySet protWords, String expected[], int startOffsets[], int endOffsets[], String types[], int posIncs[]) throws IOException { TokenStream ts = new WhitespaceTokenizer(new StringReader(text)); WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts, generateNumberParts, catenateWords, catenateNumbers, catenateAll, splitOnCaseChange, preserveOriginal, splitOnNumerics, stemEnglishPossessive, protWords); assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types, posIncs); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1820) Remove custom greek/russian charsets encoding
[ https://issues.apache.org/jira/browse/SOLR-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1820. --- Resolution: Fixed Fix Version/s: 3.1 Assignee: Robert Muir This was resolved in revision 922964. Remove custom greek/russian charsets encoding - Key: SOLR-1820 URL: https://issues.apache.org/jira/browse/SOLR-1820 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: SOLR-1820.patch In Solr 1.4, we deprecated support for 'custom encodings embedded inside unicode'. This is where the analyzer in lucene itself did encoding conversions, its better to just let analyzers be analyzers, and leave encoding conversion to Java. In order to move to Lucene 3.x, we need to remove this deprecated support, and instead issue an error in the factories if you try to do this (instead of a warning). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850612#action_12850612 ] Robert Muir commented on SOLR-1852: --- bq. The changes in the patch originate at SOLR-1706 and SOLR-1657, however I don't think it's actually the same bug as SOLR-1706 intended to fix since the the admin analyzer interface the generated tokens look correct. Yeah, I don't like the situation at all, as its not obvious to me at a glance how the trunk impl fixes your problem, but at the same time how this changed behavior slipped passed the random tests on SOLR-1710. enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Attachments: SOLR-1852.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850613#action_12850613 ] Robert Muir commented on SOLR-1852: --- ok, so your bug relates somehow to how the accumulated position increment gap is handled. This is how your stopword fits into the situation, somehow the new code is handling it better for your case, but perhaps its wrong. there are quite a few tests in TestWordDelimiter, which it passes, but I'll spend some time tonight verifying its correctness before we declare success... enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Attachments: SOLR-1852.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r928069 - in /lucene/dev/trunk: lucene/ lucene/backwards/src/test/org/apache/lucene/util/ lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/ lucene/contrib/benchmark/src/
(); + } catch(LockReleaseFailedException e) { + // well lets pretend its released anyway + } + } } catch (IOException e) { throw new RuntimeException(unable to write results, e); } finally { @@ -227,3 +254,4 @@ public class SolrJUnitResultFormatter im sb.append(StringUtils.LINE_SEP); } } + Modified: lucene/dev/trunk/solr/build.xml URL: http://svn.apache.org/viewvc/lucene/dev/trunk/solr/build.xml?rev=928069r1=928068r2=928069view=diff == --- lucene/dev/trunk/solr/build.xml (original) +++ lucene/dev/trunk/solr/build.xml Fri Mar 26 21:55:57 2010 @@ -349,6 +349,7 @@ pathelement location=${dest}/tests/ !-- include the solrj classpath and jetty files included in example -- path refid=compile.classpath.solrj / + pathelement location=${common-solr.dir}/../lucene/build/classes/test / !-- include some lucene test code -- pathelement path=${java.class.path}/ /path Modified: lucene/dev/trunk/solr/common-build.xml URL: http://svn.apache.org/viewvc/lucene/dev/trunk/solr/common-build.xml?rev=928069r1=928068r2=928069view=diff == --- lucene/dev/trunk/solr/common-build.xml (original) +++ lucene/dev/trunk/solr/common-build.xml Fri Mar 26 21:55:57 2010 @@ -103,7 +103,7 @@ property name=junit.output.dir location=${common-solr.dir}/${dest}/test-results/ property name=junit.reports location=${common-solr.dir}/${dest}/test-results/reports/ property name=junit.formatter value=plain/ - property name=junit.details.formatter value=org.apache.solr.SolrJUnitResultFormatter/ + property name=junit.details.formatter value=org.apache.lucene.util.LuceneJUnitResultFormatter/ !-- Maven properties -- property name=maven.build.dir value=${basedir}/build/maven/ -- Robert Muir rcm...@gmail.com
build.xml and lucene test code
I noticed that for whatever reason, solr's build.xml doesnt detect if lucene's test code is out of date. (I am fooling around with LUCENE-1709 where we will try to do the same parallel test execution for Lucene, as in Solr, and was moving the special formatter to lucene when i noticed this). Don't have any ideas how to fix, but just wanted to mention it so its not forgotten. worst case, if/when we resolve LUCENE-1709, you will have to run ant clean first... but I am sure there is some better ant trickery to detect this situation, maybe just another task dependency. -- Robert Muir rcm...@gmail.com
[jira] Commented: (SOLR-1835) speed up and improve tests
[ https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848586#action_12848586 ] Robert Muir commented on SOLR-1835: --- committed revision 926470 to newtrunk. if you have problems, please just revert and I will help debug them. for future speedups, we should try to move ant logic to common-build.xml and re-use it for contribs. this way, DIH tests etc will run in parallel, too. speed up and improve tests -- Key: SOLR-1835 URL: https://issues.apache.org/jira/browse/SOLR-1835 Project: Solr Issue Type: Improvement Reporter: Yonik Seeley Fix For: 3.1 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch General test improvements. We should use @BeforeClass where possible to avoid per test method overhead, and reuse lucene test utils where possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1835) speed up and improve tests
[ https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1835: -- Attachment: SOLR-1835_parallel.patch attached is a patch to parallelize the tests... improvements can be done, and contrib too (e.g. DIH) but this drops my test time to 4:42 on the first try. speed up and improve tests -- Key: SOLR-1835 URL: https://issues.apache.org/jira/browse/SOLR-1835 Project: Solr Issue Type: Improvement Reporter: Yonik Seeley Fix For: 3.1 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch General test improvements. We should use @BeforeClass where possible to avoid per test method overhead, and reuse lucene test utils where possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1835) speed up and improve tests
[ https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1835: -- Attachment: SOLR-1835_parallel.patch updated patch: * doesnt do parallel for the -Dtestcase= case, but does for all, -Dtestpackage, -Dtestpackageroot, etc * you can make the condition for whether to do parallel or not more complex, e.g. nightlies could go sequentially. speed up and improve tests -- Key: SOLR-1835 URL: https://issues.apache.org/jira/browse/SOLR-1835 Project: Solr Issue Type: Improvement Reporter: Yonik Seeley Fix For: 3.1 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch General test improvements. We should use @BeforeClass where possible to avoid per test method overhead, and reuse lucene test utils where possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1835) speed up and improve tests
[ https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1835: -- Attachment: SOLR-1835_parallel.patch attached is a new patch: * the output from multiple threads is no longer interleaved * you need to put ant.jar and ant-junit.jar in example/lib for this patch to work. these need to be ant 1.7.1 (lucene needs this version anyway i think) speed up and improve tests -- Key: SOLR-1835 URL: https://issues.apache.org/jira/browse/SOLR-1835 Project: Solr Issue Type: Improvement Reporter: Yonik Seeley Fix For: 3.1 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch General test improvements. We should use @BeforeClass where possible to avoid per test method overhead, and reuse lucene test utils where possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1835) speed up and improve tests
[ https://issues.apache.org/jira/browse/SOLR-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1835: -- Attachment: SOLR-1835_parallel.patch there was a stray slash in the previous version. this caused some people to mistakenly believe they have a faster computer than me. speed up and improve tests -- Key: SOLR-1835 URL: https://issues.apache.org/jira/browse/SOLR-1835 Project: Solr Issue Type: Improvement Reporter: Yonik Seeley Fix For: 3.1 Attachments: SOLR-1835.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch, SOLR-1835_parallel.patch General test improvements. We should use @BeforeClass where possible to avoid per test method overhead, and reuse lucene test utils where possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: rough outline of where Solr's going
On Thu, Mar 18, 2010 at 11:33 AM, Michael McCandless luc...@mikemccandless.com wrote: On version numbering... my inclination would be to let Solr and Lucene use their own version numbers (don't sync them up). I know it'd simplify our lives to have the same version across the board, but these numbers are really for our users, telling them when big changes were made, back compat broken, etc. I think that trumps dev convenience. Be sure to consider the deprecations removal, its not possible for Solr to move to Lucene's trunk without this. Here are two examples of necessary deprecation removals in the branch so that Solr can use Lucene's trunk: https://issues.apache.org/jira/browse/SOLR-1820 http://www.lucidimagination.com/search/document/f07da8e4d69f5bfe/removal_of_deprecated_htmlstrip_tokenizer_factories It seems to be the consensus that people want a major version change number when this is done. So this is an example where the version numbers of Solr really do relate to Lucene, if we want them to share the same trunk. -- Robert Muir rcm...@gmail.com
Re: rough outline of where Solr's going
On Thu, Mar 18, 2010 at 1:12 PM, Michael McCandless luc...@mikemccandless.com wrote: Ahh, OK. Meaning Solr will have to remove deprecated support, which means Solr's next released version would be a major release? Ie 2.0? Its more complex than this. Solr depends on some lucene contrib modules, which apparently have no backwards-compatibility policy. I don't think we want to have to suddenly treat all these contrib modules like core lucene with regards to backwards compat, some of them haven't reached that level of maturity yet. On the other hand, exposing contrib's functionality via Solr is a great way to get more real users and devs giving feedback and improvements to help them mature. But we need to work on how to handle some of this: I suppose spatial is the worst case (don't really know), where Solr has a dependency on a Lucene contrib specifically labelled as experimental. -- Robert Muir rcm...@gmail.com
Re: How do I contribute bug fixes
On Thu, Mar 18, 2010 at 6:49 PM, Sanjoy Ghosh san...@yahoo.com wrote: Hello, Can I submit bug fixes? If so, what is the procedure? Thanks, Sanjoy Hello, Please take a look at this link: http://wiki.apache.org/solr/HowToContribute -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
On Wed, Mar 17, 2010 at 9:09 AM, Michael McCandless luc...@mikemccandless.com wrote: Git, Maven, Hg, etc., all sound great for the future, but let's focus now on the baby step (how to re-org svn), today, so we can land the Solr upgrade work now being done on a branch... I agree. Another thing anyone can do to help if they have a spare few minutes, is to review the technical work done in the branch and provide feedback. The big JIRA issue is located at https://issues.apache.org/jira/browse/SOLR-1659 and other issues are linked to it. -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
On Wed, Mar 17, 2010 at 12:40 PM, Mark Miller markrmil...@gmail.com wrote: Okay, so this looks good to me (a few others seemed to like it - though Lucene-Dev was somehow dropped earlier) - lets try this out on the branch? (then we can get rid of that horrible branch name ;) ) Anyone on the current branch object to having to do a quick svn switch? +1 -- Robert Muir rcm...@gmail.com
Re: rough outline of where Solr's going
On Wed, Mar 17, 2010 at 8:15 PM, Chris Hostetter hossman_luc...@fucit.org wrote: My key point being: Version numbers should communicate the significance in change to the *user* of the product, and the users of Solr are differnet then the users of Lucene-Java, so even if the releases happen in lock step, that doesn't mean the verion numbers should be in lock step. As you stated modules were important to think about for svn location, then it would only make sense that they are important to think about for release numbering, too. So lets say we spin off a lucene-analyzers module, it should be 3.1, too, as its already a module to some degree, and having a lucene-analyzers-1.0.jar would be downright misleading. So from this perspective of modules, with solr being a module alongside lucene, 3.1 makes a lot of sense, and it also makes sense to try to release things together if possible so that users aren't confused. -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
On Tue, Mar 16, 2010 at 3:43 AM, Simon Willnauer simon.willna...@googlemail.com wrote: One more thing which I wonder about even more is that this whole merging happens so quickly for reasons I don't see right now. I don't want to keep anybody from making progress but it appears like a rush to me. By the way, the serious changes we applied to the branch, most of them have been sitting in JIRA over 3 months not doing much: SOLR-1659 if you follow the linked issues, you can see all the stuff that got put in the branch... the branch was helpful for me, as I could help Mark with the ton of little things, like TokenStreams embedded inside JSP files :) As its just a branch, if you want to go look at those patches (especially anything I did) and provide technical feedback, that would be great! But I think its a mistake to say things are rushed when the work has been done for months. -- Robert Muir rcm...@gmail.com
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845301#action_12845301 ] Robert Muir commented on SOLR-1804: --- I wonder if you guys have any insight why the results of this test may have changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: http://svn.apache.org/viewvc?view=revisionrevision=923048 It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why the results would change between 3.0 and 3.1-dev. One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT somewhere in its code. Any ideas? Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845451#action_12845451 ] Robert Muir commented on SOLR-1804: --- Hi Stanislaw: Correct, I did not upgrade anything else, just lucene. I'm sorry its not exactly related to this issue (although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then thats ok) My concern is more that we did something in Lucene between 3.0 and now that caused the results to be different... though again this could be explained if somewhere in its code Carrot2 uses some Lucene analysis component, but doesn't hardwire Version to LUCENE_29. If all else fails I can try to seek out the svn rev # of Lucene that causes this change, by brute force binary search :) Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845455#action_12845455 ] Robert Muir commented on SOLR-1804: --- Grant I am concerned about a possible BW break in Lucene trunk, that is all. I think its strange that 3.0 and 3.1 jars give different results. Can you tell me if the clusters are reasonable? here is the output. {noformat} junit.framework.AssertionFailedError: number of clusters: [ {labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, {labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, {labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, {labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, {labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, {labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, {labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, {labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, {labels=[Dedicated],docs=[10, 11],clusters=[]}, {labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, {labels=[Information from Large],docs=[3, 7],clusters=[]}, {labels=[Neural Networks],docs=[12, 1],clusters=[]}, {labels=[Open],docs=[15, 20],clusters=[]}, {labels=[Research],docs=[26, 8],clusters=[]}, {labels=[Other Topics],docs=[16],clusters=[]} ] expected:16 but was:15 {noformat} Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845474#action_12845474 ] Robert Muir commented on SOLR-1804: --- Thanks for the confirmation the clusters are ok. Well, this is embarrassing, it turns out it is a backwards break, though documented, and the culprit is yours truly. This is the reason it gets different results: {noformat} * LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This means that terms with a position increment gap of zero do not affect the norms calculation by default. (Robert Muir) {noformat} I'll change the test to expect 15 clusters with Lucene 3.1, thanks :) Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
removal of deprecated HtmlStrip*Tokenizer factories
Hello, Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories? These can be done with CharFilter instead and they have some problems with lucene's trunk. If no one objects, I'd like to remove these in the branch. Otherwise, Uwe tells me there is some way to make them work if need be. Thanks! -- Robert Muir rcm...@gmail.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 5:30 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Is there a way we can fix LUCENE-2098 too? I think this is good to fix, yet removing the deprecations is unrelated to this slowdown. The deprecated functionality (HtmlStrip*Tokenizer) is implemented in terms of the slower CharFilter, so its not any faster, getting rid of it won't slow anyone down. That being said I think we should still try to improve the performance of this stuff, I agree. -- Robert Muir rcm...@gmail.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 7:18 PM, Chris Hostetter hossman_luc...@fucit.org wrote: In the case of these factories: can't we eliminate the Html*Tokenizers themselves, but make the *factories* return the neccessary *Tokenizer wrapped in an HtmlStripCharFilter ? They would not be able to re-use if you did this, because when you call reset(Reader) on them, the Reader would not be wrapped. -- Robert Muir rcm...@gmail.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 7:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer combo will be able to deal with this any better, but i'll take your word for it. you can see this behavior in SolrAnalyzer's reusableTokenStream method, it re-uses the Tokenizer but wraps the readers with charStream() [overridden by TokenizerChain to wrap the Reader with your CharFilter chain]. @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { // if (true) return tokenStream(fieldName, reader); TokenStreamInfo tsi = (TokenStreamInfo)getPreviousTokenStream(); if (tsi != null) { tsi.getTokenizer().reset(charStream(reader)); // -- right here Kill it then, and we'll just have to start making a list in the Upgrading section of CHANGES.txt noting the recommended upgrad path for this (and many, many things to come i imagine) cool, I'll add some additional verbage to the CHANGES in the branch. -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
On Mon, Mar 15, 2010 at 11:43 PM, Mark Miller markrmil...@gmail.com wrote: Solr moves to Lucene's trunk: /java/trunk, /java/trunk/sol +1. With the goal of merged dev, merged tests, this looks the best to me. Simple to do patches that span both, simple to setup Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it. +1 -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
On Tue, Mar 16, 2010 at 12:01 AM, Chris Hostetter hossman_luc...@fucit.org wrote: 4) should it be possible for people to check out Lucene-Java w/o checking out Solr? (i suspect a whole lot of people who only care about the core library are going to really adamantly not want to have to check out all of Solr just to work on the core) This wouldn't really be merged development now would it? When I run 'ant test' I want the Solr tests to run, too. If one breaks because of a change, I want to look at the source and know why. -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
On Tue, Mar 16, 2010 at 12:39 AM, Chris Hostetter hossman_luc...@fucit.org wrote: And as a committer, you should be concerned about things like this ... that doesn't mean every user of Lucene-Java who wants to build from source or apply their own local patches is going to feel the same way. Yep, those users probably already hate our backwards tests and the contrib tests too. -- Robert Muir rcm...@gmail.com
[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1657: -- Attachment: SOLR-1657_synonyms_ugly_slow.patch A very very ugly, very slow, but simple and conservative conversion of SynonymFilter to the new TokenStream API. convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657_part2.patch, SOLR-1657_synonyms_ugly_slow.patch org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- -org.apache.solr.handler:- -AnalysisRequestHandler- -AnalysisRequestHandlerBase- -org.apache.solr.handler.component:- -QueryElevationComponent- -SpellCheckComponent- -org.apache.solr.highlight:- -DefaultSolrHighlighter- -org.apache.solr.spelling:- -SpellingQueryConverter- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1657: -- Attachment: SOLR-1657_synonyms_ugly_slightly_less_slow.patch attached is a less slow version of the above. it preserves the fast path from the previous code. convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657_part2.patch, SOLR-1657_synonyms_ugly_slightly_less_slow.patch, SOLR-1657_synonyms_ugly_slow.patch org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- -org.apache.solr.handler:- -AnalysisRequestHandler- -AnalysisRequestHandlerBase- -org.apache.solr.handler.component:- -QueryElevationComponent- -SpellCheckComponent- -org.apache.solr.highlight:- -DefaultSolrHighlighter- -org.apache.solr.spelling:- -SpellingQueryConverter- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1820) Remove custom greek/russian charsets encoding
Remove custom greek/russian charsets encoding - Key: SOLR-1820 URL: https://issues.apache.org/jira/browse/SOLR-1820 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Priority: Minor In Solr 1.4, we deprecated support for 'custom encodings embedded inside unicode'. This is where the analyzer in lucene itself did encoding conversions, its better to just let analyzers be analyzers, and leave encoding conversion to Java. In order to move to Lucene 3.x, we need to remove this deprecated support, and instead issue an error in the factories if you try to do this (instead of a warning). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1820) Remove custom greek/russian charsets encoding
[ https://issues.apache.org/jira/browse/SOLR-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1820: -- Attachment: SOLR-1820.patch Attached is a patch that removes the deprecates bits. If you try to specify the charset param, instead of a warning you get an error. Remove custom greek/russian charsets encoding - Key: SOLR-1820 URL: https://issues.apache.org/jira/browse/SOLR-1820 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir Priority: Minor Attachments: SOLR-1820.patch In Solr 1.4, we deprecated support for 'custom encodings embedded inside unicode'. This is where the analyzer in lucene itself did encoding conversions, its better to just let analyzers be analyzers, and leave encoding conversion to Java. In order to move to Lucene 3.x, we need to remove this deprecated support, and instead issue an error in the factories if you try to do this (instead of a warning). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
TestEvaluatorBag
Hey guys, I am seeing a test failure for TestEvaluatorBag... I wonder if you guys have any ideas, thought it might be my locale, but i changed it and I still hit it consistently. Thanks! -- Robert Muir rcm...@gmail.com
Re: TestEvaluatorBag
I think this is a platform-timezone dependent problem. This is why switching my locale didnt work, because the test started failing, today in the US we switched to Daylight Savings Time and somehow the test only fails for people with those timezones. On Sun, Mar 14, 2010 at 4:46 PM, Robert Muir rcm...@gmail.com wrote: Hey guys, I am seeing a test failure for TestEvaluatorBag... I wonder if you guys have any ideas, thought it might be my locale, but i changed it and I still hit it consistently. Thanks! -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
[jira] Commented: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag
[ https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845144#action_12845144 ] Robert Muir commented on SOLR-1821: --- Nice, fixes the issue. Can you commit this? It would help us in our current work to ensure we are not breaking tests. Failing testGetDateFormatEvaluator in TestEvaluatorBag -- Key: SOLR-1821 URL: https://issues.apache.org/jira/browse/SOLR-1821 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Chris Male Attachments: SOLR-1821.patch On some TimeZones (such as EDT currently), TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error: {code:xml} org.junit.ComparisonFailure: Expected :2010-03-12 17:15 Actual :2010-03-12 18:15 at org.junit.Assert.assertEquals(Assert.java:96) at org.junit.Assert.assertEquals(Assert.java:116) at org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127) {code} This seems due to the reliance on the System ticks in order to create the Date to compare against. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag
[ https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-1821: - Assignee: Robert Muir Failing testGetDateFormatEvaluator in TestEvaluatorBag -- Key: SOLR-1821 URL: https://issues.apache.org/jira/browse/SOLR-1821 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Chris Male Assignee: Robert Muir Attachments: SOLR-1821.patch On some TimeZones (such as EDT currently), TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error: {code:xml} org.junit.ComparisonFailure: Expected :2010-03-12 17:15 Actual :2010-03-12 18:15 at org.junit.Assert.assertEquals(Assert.java:96) at org.junit.Assert.assertEquals(Assert.java:116) at org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127) {code} This seems due to the reliance on the System ticks in order to create the Date to compare against. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1821) Failing testGetDateFormatEvaluator in TestEvaluatorBag
[ https://issues.apache.org/jira/browse/SOLR-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-1821. --- Resolution: Fixed Fix Version/s: 1.5 Committed revision 922991. Thanks Chris! Failing testGetDateFormatEvaluator in TestEvaluatorBag -- Key: SOLR-1821 URL: https://issues.apache.org/jira/browse/SOLR-1821 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 1.5 Reporter: Chris Male Assignee: Robert Muir Fix For: 1.5 Attachments: SOLR-1821.patch On some TimeZones (such as EDT currently), TestEvaluatorBag.testGetDateFormatEvaluator fails with the following error: {code:xml} org.junit.ComparisonFailure: Expected :2010-03-12 17:15 Actual :2010-03-12 18:15 at org.junit.Assert.assertEquals(Assert.java:96) at org.junit.Assert.assertEquals(Assert.java:116) at org.apache.solr.handler.dataimport.TestEvaluatorBag.testGetDateFormatEvaluator(TestEvaluatorBag.java:127) {code} This seems due to the reliance on the System ticks in order to create the Date to compare against. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1657: -- Attachment: SOLR-1657_part2.patch Here's a separate patch (_part2.patch) for all the remaining tokenstreams. The only one remaining now is SynonymFilter. For several areas in this patch, I didn't properly change any APIs to fully support the new Attributes-based API, I just got them off deprecated methods, still working with Token, and left TODOs. I figure it would be better to hash this out later on separate issues, where we modify those APIs to really take advantage of an Attributes-based API. convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657_part2.patch org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- org.apache.solr.handler: AnalysisRequestHandler AnalysisRequestHandlerBase org.apache.solr.handler.component: QueryElevationComponent SpellCheckComponent org.apache.solr.highlight: DefaultSolrHighlighter org.apache.solr.spelling: SpellingQueryConverter -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1657: -- Description: org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- -org.apache.solr.handler:- -AnalysisRequestHandler- -AnalysisRequestHandlerBase- -org.apache.solr.handler.component:- -QueryElevationComponent- -SpellCheckComponent- -org.apache.solr.highlight:- -DefaultSolrHighlighter- -org.apache.solr.spelling:- -SpellingQueryConverter- was: org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- org.apache.solr.handler: AnalysisRequestHandler AnalysisRequestHandlerBase org.apache.solr.handler.component: QueryElevationComponent SpellCheckComponent org.apache.solr.highlight: DefaultSolrHighlighter org.apache.solr.spelling: SpellingQueryConverter convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657_part2.patch org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- -org.apache.solr.handler:- -AnalysisRequestHandler- -AnalysisRequestHandlerBase- -org.apache.solr.handler.component:- -QueryElevationComponent- -SpellCheckComponent- -org.apache.solr.highlight:- -DefaultSolrHighlighter- -org.apache.solr.spelling:- -SpellingQueryConverter- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1813) Support Arabic PDF extraction
Support Arabic PDF extraction - Key: SOLR-1813 URL: https://issues.apache.org/jira/browse/SOLR-1813 Project: Solr Issue Type: Improvement Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4 Reporter: Robert Muir Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we don't have the optional dependency to do it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1813) Support Arabic PDF extraction
[ https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1813: -- Attachment: SOLR-1813.patch attached is a patch with a testcase. i can shrink the icu4j jar file if this is needed. I will attach the test pdf separately. Support Arabic PDF extraction - Key: SOLR-1813 URL: https://issues.apache.org/jira/browse/SOLR-1813 Project: Solr Issue Type: Improvement Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4 Reporter: Robert Muir Attachments: arabic.pdf, SOLR-1813.patch Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we don't have the optional dependency to do it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1813) Support Arabic PDF extraction
[ https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1813: -- Attachment: arabic.pdf the pdf file for contrib/extraction/src/test/resources/arabic.pdf Support Arabic PDF extraction - Key: SOLR-1813 URL: https://issues.apache.org/jira/browse/SOLR-1813 Project: Solr Issue Type: Improvement Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4 Reporter: Robert Muir Attachments: arabic.pdf, SOLR-1813.patch Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we don't have the optional dependency to do it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1813) Support Arabic PDF extraction
[ https://issues.apache.org/jira/browse/SOLR-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1813: -- Attachment: icu4j-4_2_1.jar the icu4j jar file that goes in contrib/extraction/lib Support Arabic PDF extraction - Key: SOLR-1813 URL: https://issues.apache.org/jira/browse/SOLR-1813 Project: Solr Issue Type: Improvement Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4 Reporter: Robert Muir Attachments: arabic.pdf, icu4j-4_2_1.jar, SOLR-1813.patch Extraction of Arabic text from PDF files is supported by tika/pdfbox, but we don't have the optional dependency to do it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Solr Performance and Scalability
Tom, this is really completely unrelated, but given that you have such huge documents and I see you have exceeded term count limits in lucene, i can't help but wonder if you have ever considered Andrzej's index pruning patch? (it is simply a tool you can run on your index) depending upon requirements, seems like it might be a good fit. http://issues.apache.org/jira/browse/LUCENE-1812 On Thu, Feb 11, 2010 at 3:11 PM, Tom Burton-West tburtonw...@gmail.comwrote: The HathiTrust Large Search indexes the OCR from 5 million volumes, with an average of 200-300 pages per volume. So the total number of pages indexed would be over 1 billion. However, we are not using pages as Solr documents, we are using the entire book, so we only have 5 million rather than 1 billion Solr documents. We also are not storing the OCRed text. Since the total size of the index for 5 million volumes is over 2 terrabytes, we split the index into 10 shards, each indexing about 1/2 million documents. Given all that, our indexes are about 250-300GB for each 500,000 books. About 85% of that is the *prx position index. Unless you have enough memory on the OS to get a significant amount of the index into the disk OS cache, disk I/O is the big bottleneck, especially for phrase queries with common words. See http://www.hathitrust.org/blogs/large-scale-search http://www.hathitrust.org/blogs/large-scale-search for more details. Have you considered storing the OCR separately rather than in the Solr index or does your use case require storing the OCR in the index? Tom Burton-West Digital Library Production Service University of Michigan Wick2804 wrote: We are thinking of creating a Lucene Solr project to store 50million full text OCRed A4 pages. Is there anyone out there who could provide some kind of guidance on the size of index we are likely to generate, and are there any gotchas in the standard analysis engines for load and query that will cause us issues. Do large indexes cause memory issues on servers? Any help or advice greatly appreciated. -- View this message in context: http://old.nabble.com/Solr-Performance-and-Scalability-tp27552013p27553353.html Sent from the Solr - Dev mailing list archive at Nabble.com. -- Robert Muir rcm...@gmail.com
Re: Timeline of upgrade to Lucene 3.0.
Hi Dawid, here is the jira issue where you can track the status of getting off deprecated 2.9 APIs: https://issues.apache.org/jira/browse/SOLR-1659 On Fri, Feb 5, 2010 at 5:40 AM, Dawid Weiss dawid.we...@gmail.com wrote: Hi there, Is there any upgrade path to Lucene 3.0 in the plans? I ask because the head of Carrot2 is using Lucene 3.0 (and there are certain incompatible API signatures, as it turns out). An upgrade to the next planned Carrot2 3.2.0 release would bring the benefit of not having to download external JARs (we replaced all the LGPL code and persuaded simple-xml author to switch to the Apache license). Corresponding Carrot2 issue for this is here: http://issues.carrot2.org/browse/CARROT-623 Dawid -- Robert Muir rcm...@gmail.com
[jira] Created: (SOLR-1760) convert synonymsfilter to new tokenstream API
convert synonymsfilter to new tokenstream API - Key: SOLR-1760 URL: https://issues.apache.org/jira/browse/SOLR-1760 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Robert Muir This is the other non-trival tokenstream to convert to the new API. I looked at this again today, and think I have a design where it will be nice and efficient. if you have ideas or are already looking at it, please comment!! I havent started coding and we shouldn't duplicate any efforts. here is my current design: * add a variable 'maximumContext' to SynonymMap. This is simply the maximum singleMatch.size(), its the maximum number of tokens lookahead that is ever needed. * save/restoreState/cloning can be minimized by using a stack (fixed array of maximumContext) of references to the SynonymMap submaps. This way we can backtrack efficiently for multiword matches without save/restoreState and less comparisons. * two queues (can be fixed arrays of maximumContext) are needed still for placing state objects. the first is those that have been evaluated (always empty in the case of !preserveOriginal), and the second is those that havent yet been evaluated, but are queued due to lookahead. i plan on coding this up soon, if you have a better idea or have started work, please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1657: -- Attachment: SOLR-1657.patch Chris's patch, except also implement BufferedTokenStream. its marked deprecated, its api cannot support custom attributes (so the 6 are simply copied into tokens and back), and its unused in solr with this patch. convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch org.apache.solr.analysis: BufferedTokenStream - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- org.apache.solr.handler: AnalysisRequestHandler AnalysisRequestHandlerBase org.apache.solr.handler.component: QueryElevationComponent SpellCheckComponent org.apache.solr.highlight: DefaultSolrHighlighter org.apache.solr.search: FieldQParserPlugin org.apache.solr.spelling: SpellingQueryConverter -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API
[ https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-1657: -- Description: org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- org.apache.solr.handler: AnalysisRequestHandler AnalysisRequestHandlerBase org.apache.solr.handler.component: QueryElevationComponent SpellCheckComponent org.apache.solr.highlight: DefaultSolrHighlighter org.apache.solr.spelling: SpellingQueryConverter was: org.apache.solr.analysis: BufferedTokenStream - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- org.apache.solr.handler: AnalysisRequestHandler AnalysisRequestHandlerBase org.apache.solr.handler.component: QueryElevationComponent SpellCheckComponent org.apache.solr.highlight: DefaultSolrHighlighter org.apache.solr.search: FieldQParserPlugin org.apache.solr.spelling: SpellingQueryConverter convert the rest of solr to use the new tokenstream API --- Key: SOLR-1657 URL: https://issues.apache.org/jira/browse/SOLR-1657 Project: Solr Issue Type: Task Reporter: Robert Muir Attachments: SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch, SOLR-1657.patch org.apache.solr.analysis: -BufferedTokenStream- - -CommonGramsFilter- - -CommonGramsQueryFilter- - -RemoveDuplicatesTokenFilter- -CapitalizationFilterFactory- -HyphenatedWordsFilter- -LengthFilter (deprecated, remove)- SynonymFilter SynonymFilterFactory -WordDelimiterFilter- org.apache.solr.handler: AnalysisRequestHandler AnalysisRequestHandlerBase org.apache.solr.handler.component: QueryElevationComponent SpellCheckComponent org.apache.solr.highlight: DefaultSolrHighlighter org.apache.solr.spelling: SpellingQueryConverter -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug
[ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829092#action_12829092 ] Robert Muir commented on SOLR-1670: --- bq. Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no? I guess all along my problem is that tokenstreams are ordered by definition. if this order does not matter, a test that uses actual queries would make more sense. The problem was that the test constructions previously used by this filter were used in other places where they really shouldn't have been, and the laxness hid real bugs (such as this very issue itself!!!). This is all I am trying to avoid. There is nothing wrong with Steven's patch/test construction, just trying to err on the side of caution. synonymfilter/map repeat bug Key: SOLR-1670 URL: https://issues.apache.org/jira/browse/SOLR-1670 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Yonik Seeley Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one. {code} // repeats map.add(strings(a b), tokens(ab), orig, merge); map.add(strings(a b), tokens(ab), orig, merge); assertTokEqual(getTokList(map,a b,false), tokens(ab)); /* in reality the result from getTokList is ab ab ab! */ {code} when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug
[ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829097#action_12829097 ] Robert Muir commented on SOLR-1670: --- bq. Not at the semantic level (for overlapping tokens). Another way to look at it is that a tokenstream is just a sequence of tokens, and posInc is just another attribute. your description of semantics makes sense in terms of how it is used by the indexer, but the order of these tokens can matter if someone uses a custom tokenfilter, it might matter for some custom attributes, and it might matter for a different consumer, its different behavior. i have made an effort to preserve all the behavior of all these tokenstreams when converting to the new api. I really don't want to break anything. synonymfilter/map repeat bug Key: SOLR-1670 URL: https://issues.apache.org/jira/browse/SOLR-1670 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Yonik Seeley Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one. {code} // repeats map.add(strings(a b), tokens(ab), orig, merge); map.add(strings(a b), tokens(ab), orig, merge); assertTokEqual(getTokList(map,a b,false), tokens(ab)); /* in reality the result from getTokList is ab ab ab! */ {code} when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug
[ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828204#action_12828204 ] Robert Muir commented on SOLR-1670: --- bq. I left in place the existing test method, which requires the specified order. is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?) i can't think of another situation where it would make sense, more likely to be abused instead synonymfilter/map repeat bug Key: SOLR-1670 URL: https://issues.apache.org/jira/browse/SOLR-1670 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Yonik Seeley Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one. {code} // repeats map.add(strings(a b), tokens(ab), orig, merge); map.add(strings(a b), tokens(ab), orig, merge); assertTokEqual(getTokList(map,a b,false), tokens(ab)); /* in reality the result from getTokList is ab ab ab! */ {code} when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1670) synonymfilter/map repeat bug
[ https://issues.apache.org/jira/browse/SOLR-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806833#action_12806833 ] Robert Muir commented on SOLR-1670: --- Steven, i don't have a problem with your patch (I do not wish to be in the way of anyone trying to work on SynonymFilter) But i want to explain some of where i was coming from. The main reason i got myself into this mess was to try to add wordnet support to solr. However, this is currently not possible without duplicating a lot of code. We need to be really careful about allowing any order, it does matter in some situations. For example, in Lucene's synonymfilter (with wordnet support), it has an option to limit the number of expansions (so its like a top-N synonym expansion). Solr doesnt currently have this, so its N/A for now, but just an example where the order suddenly becomes important. only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end() at some I would like to port these test improvements over to solr, too. synonymfilter/map repeat bug Key: SOLR-1670 URL: https://issues.apache.org/jira/browse/SOLR-1670 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 1.4 Reporter: Robert Muir Assignee: Yonik Seeley Attachments: SOLR-1670.patch, SOLR-1670.patch, SOLR-1670_test.patch as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one. {code} // repeats map.add(strings(a b), tokens(ab), orig, merge); map.add(strings(a b), tokens(ab), orig, merge); assertTokEqual(getTokList(map,a b,false), tokens(ab)); /* in reality the result from getTokList is ab ab ab! */ {code} when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805187#action_12805187 ] Robert Muir commented on SOLR-1677: --- bq. 2) Perhaps you should read the StopFilter example i already posted in my last comment... https://issues.apache.org/jira/browse/LUCENE-2094?focusedCommentId=12783932page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12783932 as far as this one goes, i specifically commented before on this not being 'hidden' by Version (with Solr users in mind) but instead its own option that every user should consider, regardless of defaults. For the stopfilter posInc the user should think it through, its pretty strange, like i mention in my comment, that a definite article like 'the' gets a posInc bump in one language but not another, simply because it happens to be separated by a space. I guess I could care less what the default is, if you care about such things you shouldn't be using the defaults and instead specifying this yourself in the schema, and Version has no effect. I can't really defend the whole stopfilter posInc thing, as again i think it doesn't make a whole lot of sense, maybe it works good for english I guess, I won't argue about it. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802979#action_12802979 ] Robert Muir commented on SOLR-1677: --- bq. The point I was trying to make is that the types of bug fixes we make in Lucene are no mathematical absolutes - we're not fixing bugs where 1+1=3. You are wrong, they are absolutes. And here are the JIRA issues for stemming bugs, since you didnt take my hint to go and actually read them. LUCENE-2055: I used the snowball tests against these stemmers which claim to implement 'snowball algorithm', and they fail. This is an absolute, and the fix is to instead use snowball. LUCENE-2203: I used the snowball tests against these stemmers and they failed. Here is Martin Porter's confirmation that these are bugs: http://article.gmane.org/gmane.comp.search.snowball/1139 Perhaps you should come up with a better example than stemming, as you don't know what you are talking about. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.