RE: finding exact case insensitive matches on single and multiword values
Alright guys, thanks! I went for storing everything lowercase in DB. I also lowercased all solr queries and only on client for UX purposes I applied the proper casing logic. Thanks for suggestions! -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2022834.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: finding exact case insensitive matches on single and multiword values
Geert-Jan and Erick, thanks! What I tried first is making it work with string type, that works perfect for all lowercase values! What I do not understand is how and why I have to make the casing work at the client, since the casing differs in the database. Right now in the database I have values for city: Den Haag Den HAAG den haag den haag using fq=city:(den\ haag) gives me 2 results. So it seems to me that because of the string type this casing issue cannot be resolved as long as I'm using this fieldtype? Then to the solution of tweaking the fieldtype for me to work. I have this right now: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType But I find it difficult to test what the result of the filters are, and since as Erick already mentioned, the result looks correct but really isnt... Is there some tool where I can add and remove the filters to quickly see what the output will be? (without having to reload schema.xml and do reimport? -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2017851.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: finding exact case insensitive matches on single and multiword values
Then to the solution of tweaking the fieldtype for me to work. I have this right now: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Additionally you can add TrimFilterFactory to your analyzer chain. And instead of escaping white spaces you can use RawQParserPlugin. fq={!raw f=city}den haag
RE: finding exact case insensitive matches on single and multiword values
ALL solr queries are case-sensitive. The trick is in the analyzers. If you downcase everything at index time before you put it in the index, and downcase all queries at query time too -- then you have case-insensitive query. Not because the Solr search algorithms are case insensitive, but because you've normalized all values to be all lowercase at both index and query time, so things will match. You can only do this kind of normalization through analyzers on a Solr text field, not a Solr string field. It's what the Solr text type is for. This wiki page, and this question in particular, will be helpful to you: http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching From: PeterKerk [vettepa...@hotmail.com] Sent: Saturday, December 04, 2010 6:24 AM To: solr-user@lucene.apache.org Subject: Re: finding exact case insensitive matches on single and multiword values Geert-Jan and Erick, thanks! What I tried first is making it work with string type, that works perfect for all lowercase values! What I do not understand is how and why I have to make the casing work at the client, since the casing differs in the database. Right now in the database I have values for city: Den Haag Den HAAG den haag den haag using fq=city:(den\ haag) gives me 2 results. So it seems to me that because of the string type this casing issue cannot be resolved as long as I'm using this fieldtype? Then to the solution of tweaking the fieldtype for me to work. I have this right now: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType But I find it difficult to test what the result of the filters are, and since as Erick already mentioned, the result looks correct but really isnt... Is there some tool where I can add and remove the filters to quickly see what the output will be? (without having to reload schema.xml and do reimport? -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2017851.html Sent from the Solr - User mailing list archive at Nabble.com.
finding exact case insensitive matches on single and multiword values
Users call this URL on my site: /?search=1city=den+haag or even /?search=1city=Den+Haag (casing of ctyname can be anything) Under water I call Solr: http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:den+haagq=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city but this returns 0 results, even though I KNOW there are exactly 54 records that have an exact match on den haag (in this case even with lower casing in DB). citynames are stored with various casings in DB, so when searching with solr, the search must ignore casing. my schema.xml fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / field name=city type=string indexed=true stored=true/ To check what was going on, I opened my analysis.jsp, for field name I provide: city for Field value (Index) I provide: den haag When I analyze this I get: den haag So that seems correct to me. Why is it that no results are returned? My requirements summarized: - I want to search independant of case on cityname: when user searches on DEn HaAG he will get the records that have value Den Haag, but also records that have den haag etc. - citynames may consists of multiple words but only an exact match is valid, so when user searches for den, he will not find den haag records. And when searched on den haag it will only return match on that and not other cities like den bosch. How can I achieve this? I think I need a new fieldtype in my schema.xml, but am not sure which tokenizers and analyzers I need, here's what I tried: fieldType name=exactmatch class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Help is really appreciated! -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012207.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: finding exact case insensitive matches on single and multiword values
The root of your problem, I think, is fq=city:den+haag which parses into city:den +defaultfield:haag Try parens, i.e. city:(den haag). Attaching debugQuery=on is often a way to see thing like this quickly Also, if you haven't seen the analysis page from the admin page, it's really valuable for figuring out the effects of analyzers. You can probably do something like: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType to get what you want. Best Erick On Fri, Dec 3, 2010 at 10:46 AM, PeterKerk vettepa...@hotmail.com wrote: Users call this URL on my site: /?search=1city=den+haag or even /?search=1city=Den+Haag (casing of ctyname can be anything) Under water I call Solr: http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:den+haagq=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city but this returns 0 results, even though I KNOW there are exactly 54 records that have an exact match on den haag (in this case even with lower casing in DB). citynames are stored with various casings in DB, so when searching with solr, the search must ignore casing. my schema.xml fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / field name=city type=string indexed=true stored=true/ To check what was going on, I opened my analysis.jsp, for field name I provide: city for Field value (Index) I provide: den haag When I analyze this I get: den haag So that seems correct to me. Why is it that no results are returned? My requirements summarized: - I want to search independant of case on cityname: when user searches on DEn HaAG he will get the records that have value Den Haag, but also records that have den haag etc. - citynames may consists of multiple words but only an exact match is valid, so when user searches for den, he will not find den haag records. And when searched on den haag it will only return match on that and not other cities like den bosch. How can I achieve this? I think I need a new fieldtype in my schema.xml, but am not sure which tokenizers and analyzers I need, here's what I tried: fieldType name=exactmatch class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Help is really appreciated! -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012207.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: finding exact case insensitive matches on single and multiword values
You are right, this is what I see when I append the debug query (very very useful btw!!!) in old situation: arr name=parsed_filter_queries strcity:den title:haag/str strPhraseQuery(themes:hotel en restaur)/str /arr I then changed the schema.xml to: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=city type=myField indexed=true stored=true/ !-- used to be string -- I then tried adding parentheses: http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city also tried (without +): http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city Then I get: arr name=parsed_filter_queries strcity:den city:haag/str /arr And still 0 results But as you can see the query is split up into 2 separate words, I dont think that is what I need? -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: finding exact case insensitive matches on single and multiword values
when you went from strField to TextField in your config you enabled tokenizing (which I believe splits on spaces by default), which is why you see seperate 'words' / terms in the debugQuery-explanation. I believe you want to keep your old strField config and try quoting: fq=city:den+haag or fq=city:den haag Concerning the lower-casing: wouldn't if be easiest to do that at the client? (I'm not sure at the moment how to do lowercasing with a strField) . Geert-jan 2010/12/3 PeterKerk vettepa...@hotmail.com You are right, this is what I see when I append the debug query (very very useful btw!!!) in old situation: arr name=parsed_filter_queries strcity:den title:haag/str strPhraseQuery(themes:hotel en restaur)/str /arr I then changed the schema.xml to: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=city type=myField indexed=true stored=true/ !-- used to be string -- I then tried adding parentheses: http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city also tried (without +): http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city Then I get: arr name=parsed_filter_queries strcity:den city:haag/str /arr And still 0 results But as you can see the query is split up into 2 separate words, I dont think that is what I need? -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: finding exact case insensitive matches on single and multiword values
Arrrgh, Geert-Jan is right, that't the 15th time at least this has tripped me up. I'm pretty sure that text will work if you escape the space, e.g. city:(den\ haag). The debug output is a little confusing since it has a line like city:den haag which almost looks wrong... but it worked out OK on a couple of queries I tried. Geert-Jan is also right in that filters aren't applied to string types so there's two possibilities, either handle the casing on the client side as he suggests and use string or make the text type work. Sorry for the confusion Erick On Fri, Dec 3, 2010 at 11:54 AM, Geert-Jan Brits gbr...@gmail.com wrote: when you went from strField to TextField in your config you enabled tokenizing (which I believe splits on spaces by default), which is why you see seperate 'words' / terms in the debugQuery-explanation. I believe you want to keep your old strField config and try quoting: fq=city:den+haag or fq=city:den haag Concerning the lower-casing: wouldn't if be easiest to do that at the client? (I'm not sure at the moment how to do lowercasing with a strField) . Geert-jan 2010/12/3 PeterKerk vettepa...@hotmail.com You are right, this is what I see when I append the debug query (very very useful btw!!!) in old situation: arr name=parsed_filter_queries strcity:den title:haag/str strPhraseQuery(themes:hotel en restaur)/str /arr I then changed the schema.xml to: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=city type=myField indexed=true stored=true/ !-- used to be string -- I then tried adding parentheses: http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city also tried (without +): http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city Then I get: arr name=parsed_filter_queries strcity:den city:haag/str /arr And still 0 results But as you can see the query is split up into 2 separate words, I dont think that is what I need? -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html Sent from the Solr - User mailing list archive at Nabble.com.