Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote:

We're (and by 'we' I mean my esteemed colleague!) working on patching a few
of these items to be in the solrconf.xml file and should likely have some
patches submitted next week.  It's being done on 'company time' and I'm not
sure about the exact policy/procedure for this sort of thing here (or
indeed, if there is one at all).


That's fine, as long as your company has agreed to contribute back the
patch (under the Apache license).  Apache enjoys a lot of business
support (being business friendly) and a *lot* of contributions is done
on company time.

Anything really big would probably need a CLA, but patches only
require clicking the grant license to ASF button in JIRA.

-Yonik


Re: case sensitivity

2007-04-27 Thread Michael Kimsal

Can you point me to the process for submitting these small patches?  I'm
looking at the jira site but don't see much of anything there outlining a
process for submitting patches.  Sorry to be so basic about this, but I'm
trying to follow correct procedures on both sides of the aisle, so to speak.


On 4/27/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote:
 We're (and by 'we' I mean my esteemed colleague!) working on patching a
few
 of these items to be in the solrconf.xml file and should likely have
some
 patches submitted next week.  It's being done on 'company time' and I'm
not
 sure about the exact policy/procedure for this sort of thing here (or
 indeed, if there is one at all).

That's fine, as long as your company has agreed to contribute back the
patch (under the Apache license).  Apache enjoys a lot of business
support (being business friendly) and a *lot* of contributions is done
on company time.

Anything really big would probably need a CLA, but patches only
require clicking the grant license to ASF button in JIRA.

-Yonik





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-27 Thread Otis Gospodnetic
Once the code/patch in the issue is put/committed to SVN, it means it will be 
in the next release.  You get your patch committed faster if it's clear, well 
written and explained, if it comes with a unit test if it's a code change, and 
so on.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Michael Kimsal [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Friday, April 27, 2007 1:47:06 PM
Subject: Re: case sensitivity

What's the procedure then for something to get included in the next
release?

Thanks again all!

On 4/27/07, Michael Kimsal [EMAIL PROTECTED] wrote:

 So I just create my own 'issue' first?  OK.  Thanks.

 On 4/27/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 
  Michael Kimsal wrote:
   Can you point me to the process for submitting these small
  patches?  I'm
   looking at the jira site but don't see much of anything there
  outlining a
   process for submitting patches.  Sorry to be so basic about this, but
  I'm
   trying to follow correct procedures on both sides of the aisle, so to
   speak.
  
 
  Check: http://wiki.apache.org/solr/HowToContribute
 
  Essentially you will create a new issue on JIRA, then upload a svn diff
  to that issue.
 
  holler if you have any troubles
 
  ryan
 
 


 --
 Michael Kimsal
 http://webdevradio.com




-- 
Michael Kimsal
http://webdevradio.com





Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote:

I think we should open up as many of the switches as we can to
QueryParser, allowing users to tinker with them if they want, setting
the defaults to the most common reasonable settings we can agree upon.


I think we should also try and handle what we can automatically too.
Always lowercasing or not isn't elegant, as the right thing to do
depends on the field.

I always had it in my head that the QueryParser should figure it out.
Actually, for good performance, the fieldType should figure it out just once.
The presense of a LowerCaseFilter could be one signal to lowercase
prefix strings,
or one could actually run a test token through analysis and test if it
comes out lowercased.

Numeric fields are a sticking point... prefix queries and wildcard
queries aren't even possible there.  Of course, even stemming is
problematic with wildcard queries.

-Yonik


Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote:

My colleague, after some digging, found in SolrQueryParser

(around line 62)
setLowercaseExpandedTerms(false);

The default for Lucene is true.  Was this intentional?  Or an oversight?


Way back before Solr was opensourced, and Chris was the only
user, I thought he needed to do prefix queries where case sensitive
wildcard queries (hence I set it to false).  I think I may have been
mistaken about that need, but by that time, I didn't know if anyone
depended on it, so I never changed it back.

A default of false is actually more powerful too.  You can do prefix
queries on fields that have a LowercaseFilter in their analyzer, and
also fields that don't.  If it's set to true, you can't reliably do
prefix queries on fields that don't have a LowercaseFilter.

-Yonik


Re: case sensitivity

2007-04-27 Thread Michael Pelz Sherman
In our experience, setting a LowercaseFilter in the query did not work; we had 
to call setLowercaseExpandedTerms(true) to get wildcard queries to be 
case-insensitive.
   
  Here's our analyzer definition from our solr schema:
   
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
   
  If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for 
case-insensitive wildcard queries, could you please provide an example of a 
solr schema that can achieve this?
   
  Thanks!
  - mps
  
Yonik Seeley [EMAIL PROTECTED] wrote:
  On 4/26/07, Michael Kimsal wrote:
 My colleague, after some digging, found in SolrQueryParser

 (around line 62)
 setLowercaseExpandedTerms(false);

 The default for Lucene is true. Was this intentional? Or an oversight?

Way back before Solr was opensourced, and Chris was the only
user, I thought he needed to do prefix queries where case sensitive
wildcard queries (hence I set it to false). I think I may have been
mistaken about that need, but by that time, I didn't know if anyone
depended on it, so I never changed it back.

A default of false is actually more powerful too. You can do prefix
queries on fields that have a LowercaseFilter in their analyzer, and
also fields that don't. If it's set to true, you can't reliably do
prefix queries on fields that don't have a LowercaseFilter.

-Yonik



Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/27/07, Michael Pelz Sherman [EMAIL PROTECTED] wrote:

In our experience, setting a LowercaseFilter in the query did not work; we had 
to call setLowercaseExpandedTerms(true) to get wildcard queries to be 
case-insensitive.


Correct, because in that case the QueryParser does not invoke analysis
(because it's a partial word, not a whole word).


  If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for 
case-insensitive wildcard queries, could you please provide an example of a 
solr schema that can achieve this?


I didn't say that :-)

I'm saying setLowercaseExpandedTerms(true) is not sufficient for
wildcard queries in general.  If the term is indexed as Windows95,
then a prefix query of Windows* won't find anything if
setLowercaseExpandedTerms(true)

-Yonik



Yonik Seeley [EMAIL PROTECTED] wrote:
  On 4/26/07, Michael Kimsal wrote:
 My colleague, after some digging, found in SolrQueryParser

 (around line 62)
 setLowercaseExpandedTerms(false);

 The default for Lucene is true. Was this intentional? Or an oversight?

Way back before Solr was opensourced, and Chris was the only
user, I thought he needed to do prefix queries where case sensitive
wildcard queries (hence I set it to false). I think I may have been
mistaken about that need, but by that time, I didn't know if anyone
depended on it, so I never changed it back.

A default of false is actually more powerful too. You can do prefix
queries on fields that have a LowercaseFilter in their analyzer, and
also fields that don't. If it's set to true, you can't reliably do
prefix queries on fields that don't have a LowercaseFilter.

-Yonik




Re: case sensitivity

2007-04-26 Thread Erik Hatcher


On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:

I've looked through the mailing lists and can't find much of anything
regarding case sensitivity.  It
seems SOLR is case sensitive by default - I'm using the default  
settings

with a very basic schema - just text fields.


All depends on the analysis you have set up for the fields.  If  
you're indexing string-type fields in the default example schema,  
there is effectively no analysis so searches must be exact matches  
case and all.


Is there any way to tell the query parser to be case insensitive  
during a

query?  Or do I have to reindex
all my data again with lowercase values?


Terms are indexed in a case-sensitive manner, so if you need case  
insensitivity you need to lowercase on the way in and on querying.


Erik




Re: case sensitivity

2007-04-26 Thread Michael Kimsal

I was just writing a followup.

I'm using the default text field type

   fieldtype name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   !-- in this example, we will only use synonyms at query time
   filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
   --
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
   /fieldtype


That looks to me like it's got LowerCaseFilterFactory in the query analyzer
and the index analyzer.

I'm still digging in to this, but are there any other things to look for
anyone can point me to?  (Thanks Erik!)




On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote:



On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:
 I've looked through the mailing lists and can't find much of anything
 regarding case sensitivity.  It
 seems SOLR is case sensitive by default - I'm using the default
 settings
 with a very basic schema - just text fields.

All depends on the analysis you have set up for the fields.  If
you're indexing string-type fields in the default example schema,
there is effectively no analysis so searches must be exact matches
case and all.

 Is there any way to tell the query parser to be case insensitive
 during a
 query?  Or do I have to reindex
 all my data again with lowercase values?

Terms are indexed in a case-sensitive manner, so if you need case
insensitivity you need to lowercase on the way in and on querying.

Erik






--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Michael Kimsal

type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) )
and
type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) )

Is this to do with the wildcards?

Actually, I've just answered my own question.

type:changelog AND ( ( (listing:fox) ) )
and
type:changelog AND ( ( (listing:Fox) ) )

give the same results.

But adding in the or listing:fox* or listing:*fox is always case-sensitive.
However,
http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems
to say that wildcard searches are not case-sensitive.

Unless someone can point out a way around this, it seems I'll need to
manually reindex and lower-case everything on the way in, then reformat my
search queries to be lower-case as well.



On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote:


I was just writing a followup.

I'm using the default text field type

fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index

tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/

--
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=1 catenateAll=0/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=
solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=
solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true 
expand=true/
filter class=solr.StopFilterFactory ignoreCase=true words=
stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=0 catenateNumbers=0 catenateAll=0/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
filter class=
solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldtype


That looks to me like it's got LowerCaseFilterFactory in the query
analyzer and the index analyzer.

I'm still digging in to this, but are there any other things to look for
anyone can point me to?  (Thanks Erik!)




On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote:


 On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:
  I've looked through the mailing lists and can't find much of anything
  regarding case sensitivity.  It
  seems SOLR is case sensitive by default - I'm using the default
  settings
  with a very basic schema - just text fields.

 All depends on the analysis you have set up for the fields.  If
 you're indexing string-type fields in the default example schema,
 there is effectively no analysis so searches must be exact matches
 case and all.

  Is there any way to tell the query parser to be case insensitive
  during a
  query?  Or do I have to reindex
  all my data again with lowercase values?

 Terms are indexed in a case-sensitive manner, so if you need case
 insensitivity you need to lowercase on the way in and on querying.

 Erik





--
Michael Kimsal
http://webdevradio.com





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Michael Kimsal

My colleague, after some digging, found in SolrQueryParser

(around line 62)
setLowercaseExpandedTerms(false);

The default for Lucene is true.  Was this intentional?  Or an oversight?

Perhaps it's not related to my problem, but it seems that it might be.

Thanks in advance!

On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote:


type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) )
)
and
type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) )
)

Is this to do with the wildcards?

Actually, I've just answered my own question.

type:changelog AND ( ( (listing:fox) ) )
and
type:changelog AND ( ( (listing:Fox) ) )

give the same results.

But adding in the or listing:fox* or listing:*fox is always
case-sensitive. However,
http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems
 to say that wildcard searches are not case-sensitive.

Unless someone can point out a way around this, it seems I'll need to
manually reindex and lower-case everything on the way in, then reformat my
search queries to be lower-case as well.



On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote:

 I was just writing a followup.

 I'm using the default text field type

 fieldtype name=text class=solr.TextField positionIncrementGap=100
   analyzer type=index


 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=false/


 --
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=1 catenateAll=0/


 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
 filter class=

 solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=

 solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true 
expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true words=

 stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 
catenateWords=0 catenateNumbers=0 catenateAll=0/


 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/
 filter class=

 solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldtype


 That looks to me like it's got LowerCaseFilterFactory in the query
 analyzer and the index analyzer.

 I'm still digging in to this, but are there any other things to look for
 anyone can point me to?  (Thanks Erik!)




 On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote:
 
 
  On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:
   I've looked through the mailing lists and can't find much of
  anything
   regarding case sensitivity.  It
   seems SOLR is case sensitive by default - I'm using the default
   settings
   with a very basic schema - just text fields.
 
  All depends on the analysis you have set up for the fields.  If
  you're indexing string-type fields in the default example schema,
  there is effectively no analysis so searches must be exact matches
  case and all.
 
   Is there any way to tell the query parser to be case insensitive
   during a
   query?  Or do I have to reindex
   all my data again with lowercase values?
 
  Terms are indexed in a case-sensitive manner, so if you need case
  insensitivity you need to lowercase on the way in and on querying.
 
  Erik
 
 
 


 --
 Michael Kimsal
 http://webdevradio.com




--
Michael Kimsal
http://webdevradio.com





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Erik Hatcher


On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote:

My colleague, after some digging, found in SolrQueryParser

(around line 62)
setLowercaseExpandedTerms(false);

The default for Lucene is true.  Was this intentional?  Or an  
oversight?


I was just about to respond that this is likely the issue with your  
non-totally-lowercased wildcard terms.


I don't consider it an oversight, but rather this whole analysis  
business and wildcards are things that vary from project to project  
on how they should be handled.  If you, have, for example, a string  
field and want to do prefixed queries on them (trailing asterisk) you  
wouldn't want the term to be lowercased.


I think we should open up as many of the switches as we can to  
QueryParser, allowing users to tinker with them if they want, setting  
the defaults to the most common reasonable settings we can agree upon.


Erik



Re: case sensitivity

2007-04-26 Thread Michael Kimsal

We're (and by 'we' I mean my esteemed colleague!) working on patching a few
of these items to be in the solrconf.xml file and should likely have some
patches submitted next week.  It's being done on 'company time' and I'm not
sure about the exact policy/procedure for this sort of thing here (or
indeed, if there is one at all).


On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote:



On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote:
 My colleague, after some digging, found in SolrQueryParser

 (around line 62)
 setLowercaseExpandedTerms(false);

 The default for Lucene is true.  Was this intentional?  Or an
 oversight?

I was just about to respond that this is likely the issue with your
non-totally-lowercased wildcard terms.

I don't consider it an oversight, but rather this whole analysis
business and wildcards are things that vary from project to project
on how they should be handled.  If you, have, for example, a string
field and want to do prefixed queries on them (trailing asterisk) you
wouldn't want the term to be lowercased.

I think we should open up as many of the switches as we can to
QueryParser, allowing users to tinker with them if they want, setting
the defaults to the most common reasonable settings we can agree upon.

Erik





--
Michael Kimsal
http://webdevradio.com