RE: finding exact case insensitive matches on single and multiword values

2010-12-05 Thread PeterKerk

Alright guys, thanks!

I went for storing everything lowercase in DB. I also lowercased all solr
queries and only on client for UX purposes I applied the proper casing
logic.

Thanks for suggestions!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2022834.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: finding exact case insensitive matches on single and multiword values

2010-12-04 Thread PeterKerk

Geert-Jan and Erick, thanks!

What I tried first is making it work with string type, that works perfect
for all lowercase values!

What I do not understand is how and why I have to make the casing work at
the client, since the casing differs in the database. Right now in the
database I have values for city:
Den Haag
Den HAAG
den haag
den haag

using fq=city:(den\ haag) gives me 2 results.

So it seems to me that because of the string type this casing issue cannot
be resolved as long as I'm using this fieldtype?


Then to the solution of tweaking the fieldtype for me to work.
I have this right now:

fieldType name=myField class=solr.TextField sortMissingLast=true
omitNorms=true 
analyzer 
tokenizer class=solr.KeywordTokenizerFactory/ 
filter class=solr.LowerCaseFilterFactory/ 
/analyzer 
/fieldType 

But I find it difficult to test what the result of the filters are, and
since as Erick already mentioned, the result looks correct but really
isnt...
Is there some tool where I can add and remove the filters to quickly see
what the output will be? (without having to reload schema.xml and do
reimport?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2017851.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: finding exact case insensitive matches on single and multiword values

2010-12-04 Thread Ahmet Arslan
 Then to the solution of tweaking the fieldtype for me to
 work.
 I have this right now:
     
     fieldType name=myField
 class=solr.TextField sortMissingLast=true
 omitNorms=true 
     analyzer 
         tokenizer
 class=solr.KeywordTokenizerFactory/ 
         filter
 class=solr.LowerCaseFilterFactory/ 
     /analyzer 
     /fieldType 


Additionally you can add TrimFilterFactory to your analyzer chain. 

And instead of escaping white spaces you can use RawQParserPlugin.
fq={!raw f=city}den haag







RE: finding exact case insensitive matches on single and multiword values

2010-12-04 Thread Jonathan Rochkind
ALL solr queries are case-sensitive.  

The trick is in the analyzers.  If you downcase everything at index time before 
you put it in the index, and downcase all queries at query time too -- then you 
have case-insensitive query.   Not because the Solr search algorithms are case 
insensitive, but because you've normalized all values to be all lowercase at 
both index and query time, so things will match. 

You can only do this kind of normalization through analyzers on a Solr text 
field, not a Solr string field. It's what the Solr text type is for. 

This wiki page, and this question in particular, will be helpful to you:
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching

From: PeterKerk [vettepa...@hotmail.com]
Sent: Saturday, December 04, 2010 6:24 AM
To: solr-user@lucene.apache.org
Subject: Re: finding exact case insensitive matches on single and multiword 
values

Geert-Jan and Erick, thanks!

What I tried first is making it work with string type, that works perfect
for all lowercase values!

What I do not understand is how and why I have to make the casing work at
the client, since the casing differs in the database. Right now in the
database I have values for city:
Den Haag
Den HAAG
den haag
den haag

using fq=city:(den\ haag) gives me 2 results.

So it seems to me that because of the string type this casing issue cannot
be resolved as long as I'm using this fieldtype?


Then to the solution of tweaking the fieldtype for me to work.
I have this right now:

fieldType name=myField class=solr.TextField sortMissingLast=true
omitNorms=true
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

But I find it difficult to test what the result of the filters are, and
since as Erick already mentioned, the result looks correct but really
isnt...
Is there some tool where I can add and remove the filters to quickly see
what the output will be? (without having to reload schema.xml and do
reimport?
--
View this message in context: 
http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2017851.html
Sent from the Solr - User mailing list archive at Nabble.com.


finding exact case insensitive matches on single and multiword values

2010-12-03 Thread PeterKerk


Users call this URL on my site:
/?search=1city=den+haag
or even /?search=1city=Den+Haag (casing of ctyname can be anything)


Under water I call Solr:
http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:den+haagq=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city


but this returns 0 results, even though I KNOW there are exactly 54 records
that have an exact match on den haag (in this case even with lower casing
in DB).

citynames are stored with various casings in DB, so when searching with
solr, the search must ignore casing.


my schema.xml

fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true /
field name=city type=string indexed=true stored=true/


To check what was going on, I opened my analysis.jsp, 

for field name I provide: city
for Field value (Index)  I provide: den haag
When I analyze this I get:
den haag

So that seems correct to me. Why is it that no results are returned?

My requirements summarized:
- I want to search independant of case on cityname:
when user searches on DEn HaAG he will get the records that have value
Den Haag, but also records that have den haag etc.
- citynames may consists of multiple words but only an exact match is valid,
so when user searches for den, he will not find den haag records. And
when searched on den haag it will only return match on that and not other
cities like den bosch.

How can I achieve this?

I think I need a new fieldtype  in my schema.xml, but am not sure which
tokenizers and analyzers I need, here's what I tried:

fieldType name=exactmatch class=solr.TextField
positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_dutch.txt /
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


Help is really appreciated!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012207.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: finding exact case insensitive matches on single and multiword values

2010-12-03 Thread Erick Erickson
The root of your problem, I think, is fq=city:den+haag which parses into
city:den +defaultfield:haag

Try parens, i.e. city:(den haag).

Attaching debugQuery=on is often a way to see thing like this quickly

Also, if you haven't seen the analysis page from the admin page, it's really
valuable
for figuring out the effects of analyzers. You can probably do something
like:

fieldType name=myField class=solr.TextField sortMissingLast=true
omitNorms=true
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

to get what you want.

Best
Erick

On Fri, Dec 3, 2010 at 10:46 AM, PeterKerk vettepa...@hotmail.com wrote:



 Users call this URL on my site:
 /?search=1city=den+haag
 or even /?search=1city=Den+Haag (casing of ctyname can be anything)


 Under water I call Solr:

 http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:den+haagq=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city


 but this returns 0 results, even though I KNOW there are exactly 54 records
 that have an exact match on den haag (in this case even with lower casing
 in DB).

 citynames are stored with various casings in DB, so when searching with
 solr, the search must ignore casing.


 my schema.xml

 fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true /
 field name=city type=string indexed=true stored=true/


 To check what was going on, I opened my analysis.jsp,

 for field name I provide: city
 for Field value (Index)  I provide: den haag
 When I analyze this I get:
 den haag

 So that seems correct to me. Why is it that no results are returned?

 My requirements summarized:
 - I want to search independant of case on cityname:
when user searches on DEn HaAG he will get the records that have
 value
 Den Haag, but also records that have den haag etc.
 - citynames may consists of multiple words but only an exact match is
 valid,
 so when user searches for den, he will not find den haag records. And
 when searched on den haag it will only return match on that and not other
 cities like den bosch.

 How can I achieve this?

 I think I need a new fieldtype  in my schema.xml, but am not sure which
 tokenizers and analyzers I need, here's what I tried:

 fieldType name=exactmatch class=solr.TextField
 positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_dutch.txt /
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
 /fieldType


 Help is really appreciated!
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012207.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: finding exact case insensitive matches on single and multiword values

2010-12-03 Thread PeterKerk


You are right, this is what I see when I append the debug query (very very
useful btw!!!) in old situation:
arr name=parsed_filter_queries
strcity:den title:haag/str
strPhraseQuery(themes:hotel en restaur)/str
/arr



I then changed the schema.xml to:

fieldType name=myField class=solr.TextField sortMissingLast=true
omitNorms=true 
analyzer 
tokenizer class=solr.KeywordTokenizerFactory/ 
filter class=solr.LowerCaseFilterFactory/ 
/analyzer 
/fieldType 

field name=city type=myField indexed=true stored=true/ !-- used
to be string --


I then tried adding parentheses:
http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city
also tried (without +):
http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den
haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city

Then I get:

arr name=parsed_filter_queries
strcity:den city:haag/str
/arr

And still 0 results

But as you can see the query is split up into 2 separate words, I dont think
that is what I need?


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: finding exact case insensitive matches on single and multiword values

2010-12-03 Thread Geert-Jan Brits
when you went from strField to TextField in your config you enabled
tokenizing (which I believe splits on spaces by default),
which is why you see seperate 'words' / terms in the debugQuery-explanation.

I believe you want to keep your old strField config and try quoting:

fq=city:den+haag or fq=city:den haag

Concerning the lower-casing: wouldn't if be easiest to do that at the
client? (I'm not sure at the moment how to do lowercasing with a strField)
.

Geert-jan


2010/12/3 PeterKerk vettepa...@hotmail.com



 You are right, this is what I see when I append the debug query (very very
 useful btw!!!) in old situation:
 arr name=parsed_filter_queries
strcity:den title:haag/str
strPhraseQuery(themes:hotel en restaur)/str
 /arr



 I then changed the schema.xml to:

 fieldType name=myField class=solr.TextField sortMissingLast=true
 omitNorms=true
 analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 field name=city type=myField indexed=true stored=true/ !-- used
 to be string --


 I then tried adding parentheses:

 http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city
 also tried (without +):
 http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den
 haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city

 Then I get:

 arr name=parsed_filter_queries
strcity:den city:haag/str
 /arr

 And still 0 results

 But as you can see the query is split up into 2 separate words, I dont
 think
 that is what I need?


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: finding exact case insensitive matches on single and multiword values

2010-12-03 Thread Erick Erickson
Arrrgh, Geert-Jan is right, that't the 15th time at least this has tripped
me up.

I'm pretty sure that text will work if you escape the space, e.g.
city:(den\ haag). The debug output is a little confusing since it has a line
like
city:den haag

which almost looks wrong... but it worked
out OK on a couple of queries I tried.

Geert-Jan is also right in that filters aren't applied to string types
so there's two possibilities, either handle the casing on the client
side as he suggests and use string or make the text type work.


Sorry for the confusion
Erick

On Fri, Dec 3, 2010 at 11:54 AM, Geert-Jan Brits gbr...@gmail.com wrote:

 when you went from strField to TextField in your config you enabled
 tokenizing (which I believe splits on spaces by default),
 which is why you see seperate 'words' / terms in the
 debugQuery-explanation.

 I believe you want to keep your old strField config and try quoting:

 fq=city:den+haag or fq=city:den haag

 Concerning the lower-casing: wouldn't if be easiest to do that at the
 client? (I'm not sure at the moment how to do lowercasing with a strField)
 .

 Geert-jan


 2010/12/3 PeterKerk vettepa...@hotmail.com

 
 
  You are right, this is what I see when I append the debug query (very
 very
  useful btw!!!) in old situation:
  arr name=parsed_filter_queries
 strcity:den title:haag/str
 strPhraseQuery(themes:hotel en restaur)/str
  /arr
 
 
 
  I then changed the schema.xml to:
 
  fieldType name=myField class=solr.TextField sortMissingLast=true
  omitNorms=true
  analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
  /analyzer
  /fieldType
 
  field name=city type=myField indexed=true stored=true/ !--
 used
  to be string --
 
 
  I then tried adding parentheses:
 
 
 http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city
  also tried (without +):
  http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den
  haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city
 
  Then I get:
 
  arr name=parsed_filter_queries
 strcity:den city:haag/str
  /arr
 
  And still 0 results
 
  But as you can see the query is split up into 2 separate words, I dont
  think
  that is what I need?
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
  Sent from the Solr - User mailing list archive at Nabble.com.