Re: case sensitivity
On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). That's fine, as long as your company has agreed to contribute back the patch (under the Apache license). Apache enjoys a lot of business support (being business friendly) and a *lot* of contributions is done on company time. Anything really big would probably need a CLA, but patches only require clicking the grant license to ASF button in JIRA. -Yonik
Re: case sensitivity
Can you point me to the process for submitting these small patches? I'm looking at the jira site but don't see much of anything there outlining a process for submitting patches. Sorry to be so basic about this, but I'm trying to follow correct procedures on both sides of the aisle, so to speak. On 4/27/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). That's fine, as long as your company has agreed to contribute back the patch (under the Apache license). Apache enjoys a lot of business support (being business friendly) and a *lot* of contributions is done on company time. Anything really big would probably need a CLA, but patches only require clicking the grant license to ASF button in JIRA. -Yonik -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
Once the code/patch in the issue is put/committed to SVN, it means it will be in the next release. You get your patch committed faster if it's clear, well written and explained, if it comes with a unit test if it's a code change, and so on. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Michael Kimsal [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, April 27, 2007 1:47:06 PM Subject: Re: case sensitivity What's the procedure then for something to get included in the next release? Thanks again all! On 4/27/07, Michael Kimsal [EMAIL PROTECTED] wrote: So I just create my own 'issue' first? OK. Thanks. On 4/27/07, Ryan McKinley [EMAIL PROTECTED] wrote: Michael Kimsal wrote: Can you point me to the process for submitting these small patches? I'm looking at the jira site but don't see much of anything there outlining a process for submitting patches. Sorry to be so basic about this, but I'm trying to follow correct procedures on both sides of the aisle, so to speak. Check: http://wiki.apache.org/solr/HowToContribute Essentially you will create a new issue on JIRA, then upload a svn diff to that issue. holler if you have any troubles ryan -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. I think we should also try and handle what we can automatically too. Always lowercasing or not isn't elegant, as the right thing to do depends on the field. I always had it in my head that the QueryParser should figure it out. Actually, for good performance, the fieldType should figure it out just once. The presense of a LowerCaseFilter could be one signal to lowercase prefix strings, or one could actually run a test token through analysis and test if it comes out lowercased. Numeric fields are a sticking point... prefix queries and wildcard queries aren't even possible there. Of course, even stemming is problematic with wildcard queries. -Yonik
Re: case sensitivity
On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Way back before Solr was opensourced, and Chris was the only user, I thought he needed to do prefix queries where case sensitive wildcard queries (hence I set it to false). I think I may have been mistaken about that need, but by that time, I didn't know if anyone depended on it, so I never changed it back. A default of false is actually more powerful too. You can do prefix queries on fields that have a LowercaseFilter in their analyzer, and also fields that don't. If it's set to true, you can't reliably do prefix queries on fields that don't have a LowercaseFilter. -Yonik
Re: case sensitivity
In our experience, setting a LowercaseFilter in the query did not work; we had to call setLowercaseExpandedTerms(true) to get wildcard queries to be case-insensitive. Here's our analyzer definition from our solr schema: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for case-insensitive wildcard queries, could you please provide an example of a solr schema that can achieve this? Thanks! - mps Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Way back before Solr was opensourced, and Chris was the only user, I thought he needed to do prefix queries where case sensitive wildcard queries (hence I set it to false). I think I may have been mistaken about that need, but by that time, I didn't know if anyone depended on it, so I never changed it back. A default of false is actually more powerful too. You can do prefix queries on fields that have a LowercaseFilter in their analyzer, and also fields that don't. If it's set to true, you can't reliably do prefix queries on fields that don't have a LowercaseFilter. -Yonik
Re: case sensitivity
On 4/27/07, Michael Pelz Sherman [EMAIL PROTECTED] wrote: In our experience, setting a LowercaseFilter in the query did not work; we had to call setLowercaseExpandedTerms(true) to get wildcard queries to be case-insensitive. Correct, because in that case the QueryParser does not invoke analysis (because it's a partial word, not a whole word). If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for case-insensitive wildcard queries, could you please provide an example of a solr schema that can achieve this? I didn't say that :-) I'm saying setLowercaseExpandedTerms(true) is not sufficient for wildcard queries in general. If the term is indexed as Windows95, then a prefix query of Windows* won't find anything if setLowercaseExpandedTerms(true) -Yonik Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Way back before Solr was opensourced, and Chris was the only user, I thought he needed to do prefix queries where case sensitive wildcard queries (hence I set it to false). I think I may have been mistaken about that need, but by that time, I didn't know if anyone depended on it, so I never changed it back. A default of false is actually more powerful too. You can do prefix queries on fields that have a LowercaseFilter in their analyzer, and also fields that don't. If it's set to true, you can't reliably do prefix queries on fields that don't have a LowercaseFilter. -Yonik
Re: case sensitivity
On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik
Re: case sensitivity
I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) ) and type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) ) Is this to do with the wildcards? Actually, I've just answered my own question. type:changelog AND ( ( (listing:fox) ) ) and type:changelog AND ( ( (listing:Fox) ) ) give the same results. But adding in the or listing:fox* or listing:*fox is always case-sensitive. However, http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems to say that wildcard searches are not case-sensitive. Unless someone can point out a way around this, it seems I'll need to manually reindex and lower-case everything on the way in, then reformat my search queries to be lower-case as well. On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class= solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Perhaps it's not related to my problem, but it seems that it might be. Thanks in advance! On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) ) and type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) ) Is this to do with the wildcards? Actually, I've just answered my own question. type:changelog AND ( ( (listing:fox) ) ) and type:changelog AND ( ( (listing:Fox) ) ) give the same results. But adding in the or listing:fox* or listing:*fox is always case-sensitive. However, http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems to say that wildcard searches are not case-sensitive. Unless someone can point out a way around this, it seems I'll need to manually reindex and lower-case everything on the way in, then reformat my search queries to be lower-case as well. On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class= solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? I was just about to respond that this is likely the issue with your non-totally-lowercased wildcard terms. I don't consider it an oversight, but rather this whole analysis business and wildcards are things that vary from project to project on how they should be handled. If you, have, for example, a string field and want to do prefixed queries on them (trailing asterisk) you wouldn't want the term to be lowercased. I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. Erik
Re: case sensitivity
We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? I was just about to respond that this is likely the issue with your non-totally-lowercased wildcard terms. I don't consider it an oversight, but rather this whole analysis business and wildcards are things that vary from project to project on how they should be handled. If you, have, for example, a string field and want to do prefixed queries on them (trailing asterisk) you wouldn't want the term to be lowercased. I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. Erik -- Michael Kimsal http://webdevradio.com