RE: Solr, SQL Server's LIKE
Great suggestion! Thanks for keeping it simple for a complete Solr newbie. I'm going to go try this right now. Thanks! Devon Baumgarten -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Monday, January 02, 2012 12:30 PM To: solr-user@lucene.apache.org Subject: Re: Solr, SQL Server's LIKE On 12/29/2011 3:51 PM, Devon Baumgarten wrote: > N-Grams get me pretty great results in general, but I don't want the results > for this particular search to be fuzzy. How can I prevent the fuzzy matches > from appearing? > > Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather > than having a low score. To achieve this while using the ngram filter, just do the ngram analysis on the index side, but not on the query side. If you do this, you'll likely need a maxGramSize larger than would normally be required (which will make the index larger), and you might need to use the LengthFilter too. Thanks, Shawn
Re: Solr, SQL Server's LIKE
On 12/29/2011 3:51 PM, Devon Baumgarten wrote: N-Grams get me pretty great results in general, but I don't want the results for this particular search to be fuzzy. How can I prevent the fuzzy matches from appearing? Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather than having a low score. To achieve this while using the ngram filter, just do the ngram analysis on the index side, but not on the query side. If you do this, you'll likely need a maxGramSize larger than would normally be required (which will make the index larger), and you might need to use the LengthFilter too. Thanks, Shawn
Re: Solr, SQL Server's LIKE
Thanks, Erick! That sounds great. I really do have to upgrade. Chantal On Sun, 2012-01-01 at 16:42 +0100, Erick Erickson wrote: > Chantal: > > bq: The problem with the wildcard searches is that the input is not > analyzed. > > As of 3.6/4.0, this is no longer entirely true. Some analysis is > performed for wildcard searches by default and you can > specify most anything you want if you really need to see: > https://issues.apache.org/jira/browse/SOLR-2438 > and > http://wiki.apache.org/solr/MultitermQueryAnalysis > > Best > Erick
Re: Solr, SQL Server's LIKE
Chantal: bq: The problem with the wildcard searches is that the input is not analyzed. As of 3.6/4.0, this is no longer entirely true. Some analysis is performed for wildcard searches by default and you can specify most anything you want if you really need to see: https://issues.apache.org/jira/browse/SOLR-2438 and http://wiki.apache.org/solr/MultitermQueryAnalysis Best Erick On Fri, Dec 30, 2011 at 4:33 PM, Devon Baumgarten wrote: > Hoss, > > Thanks. You've answered my question. To clarify, what I should have asked for > instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me > that I didn't need n-grams to use the wildcard. You asking for me to clarify > what I meant made me realize that the n-grams are the source of all my > current problems. :) > > Thanks! > > Devon Baumgarten > > > -Original Message- > From: Chris Hostetter [mailto:hossman_luc...@fucit.org] > Sent: Thursday, December 29, 2011 7:00 PM > To: solr-user@lucene.apache.org > Subject: RE: Solr, SQL Server's LIKE > > > : Thanks. I know I'll be able to utilize some of Solr's free text > : searching capabilities in other search types in this project. The > : product manager wants this particular search to exactly mimic LIKE%. > ... > : Ex: If I search "Albatross" I want "Albert" to be excluded completely, > : rather than having a low score. > > please be specific about the types of queries you want. ie: we need more > then one example of the type of input you want to provide, the type of > matches you want to see for that input, and the type of matches you want > to get back. > > in your first message you said you need to match company titles "pretty > exactly" but then seem to contradict yourself by saying the SQL's LIKE > command fit's the bill -- even though the SQL LIKE command exists > specificly for in-exact matches on field values. > > Based on your one example above of Albatross, you don't need anything > special: don't use ngrams, don't use stemming, don't use fuzzy anything -- > just search for "Albatross" and it will match "Albatross" but not > "Albert". if you want "Albatross" to match "Albatross Road" use some > basic tokenization. > > If all you really care about is prefix searching (which seems suggested by > your "LIKE%" comment above, which i'm guessing is shorthand for something > similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both > match "abcdef" and "abcd" but neither of them match "abcd" > then just use prefix queries (ie: "abcd*") -- they should be plenty > efficient for your purposes. you only need to worry about ngrams when you > want to efficiently match in the middle of a string. (ie: "TITLE LIKE > %ABC%") > > > -Hoss
RE: Solr, SQL Server's LIKE
Hoss, Thanks. You've answered my question. To clarify, what I should have asked for instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that I didn't need n-grams to use the wildcard. You asking for me to clarify what I meant made me realize that the n-grams are the source of all my current problems. :) Thanks! Devon Baumgarten -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, December 29, 2011 7:00 PM To: solr-user@lucene.apache.org Subject: RE: Solr, SQL Server's LIKE : Thanks. I know I'll be able to utilize some of Solr's free text : searching capabilities in other search types in this project. The : product manager wants this particular search to exactly mimic LIKE%. ... : Ex: If I search "Albatross" I want "Albert" to be excluded completely, : rather than having a low score. please be specific about the types of queries you want. ie: we need more then one example of the type of input you want to provide, the type of matches you want to see for that input, and the type of matches you want to get back. in your first message you said you need to match company titles "pretty exactly" but then seem to contradict yourself by saying the SQL's LIKE command fit's the bill -- even though the SQL LIKE command exists specificly for in-exact matches on field values. Based on your one example above of Albatross, you don't need anything special: don't use ngrams, don't use stemming, don't use fuzzy anything -- just search for "Albatross" and it will match "Albatross" but not "Albert". if you want "Albatross" to match "Albatross Road" use some basic tokenization. If all you really care about is prefix searching (which seems suggested by your "LIKE%" comment above, which i'm guessing is shorthand for something similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both match "abcdef" and "abcd" but neither of them match "abcd" then just use prefix queries (ie: "abcd*") -- they should be plenty efficient for your purposes. you only need to worry about ngrams when you want to efficiently match in the middle of a string. (ie: "TITLE LIKE %ABC%") -Hoss
RE: Solr, SQL Server's LIKE
The problem with the wildcard searches is that the input is not analyzed. For english, this might not be such a problem (except if you expect case insenstive search). But than again, you don't get that with like, either. Ngrams bring that and more. What I think is often forgotten when comparing 'like' and Solr search is: Solr's analyzer allow not only for case insenstive search but also for other analysis such as removing diacritics and this is also applied when sorting (you have to create a separate index in the DB, as well, if you want that). Say you have the following names: 'Van Hinden' 'van Hinden' 'Música' 'Musil' like 'mu%' - no hits like 'Mu%' - 1 hit like 'van%' - 1 hit like 'hin%' - no hits with Solr whitespace or standard tokenizer, ngrams and a diacritcs and lowercase filter (no wildcard search): 'mu'/'Mu' - 2 hits sorted ignoring case and diacritics 'van' - 2 hits 'hin' - 2 hits (This is written down from experience. I haven't checked those examples explicitly.) Cheers, Chantal On Fri, 2011-12-30 at 02:00 +0100, Chris Hostetter wrote: > : Thanks. I know I'll be able to utilize some of Solr's free text > : searching capabilities in other search types in this project. The > : product manager wants this particular search to exactly mimic LIKE%. > ... > : Ex: If I search "Albatross" I want "Albert" to be excluded completely, > : rather than having a low score. > > please be specific about the types of queries you want. ie: we need more > then one example of the type of input you want to provide, the type of > matches you want to see for that input, and the type of matches you want > to get back. > > in your first message you said you need to match company titles "pretty > exactly" but then seem to contradict yourself by saying the SQL's LIKE > command fit's the bill -- even though the SQL LIKE command exists > specificly for in-exact matches on field values. > > Based on your one example above of Albatross, you don't need anything > special: don't use ngrams, don't use stemming, don't use fuzzy anything -- > just search for "Albatross" and it will match "Albatross" but not > "Albert". if you want "Albatross" to match "Albatross Road" use some > basic tokenization. > > If all you really care about is prefix searching (which seems suggested by > your "LIKE%" comment above, which i'm guessing is shorthand for something > similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both > match "abcdef" and "abcd" but neither of them match "abcd" > then just use prefix queries (ie: "abcd*") -- they should be plenty > efficient for your purposes. you only need to worry about ngrams when you > want to efficiently match in the middle of a string. (ie: "TITLE LIKE > %ABC%") > > > -Hoss
RE: Solr, SQL Server's LIKE
: Thanks. I know I'll be able to utilize some of Solr's free text : searching capabilities in other search types in this project. The : product manager wants this particular search to exactly mimic LIKE%. ... : Ex: If I search "Albatross" I want "Albert" to be excluded completely, : rather than having a low score. please be specific about the types of queries you want. ie: we need more then one example of the type of input you want to provide, the type of matches you want to see for that input, and the type of matches you want to get back. in your first message you said you need to match company titles "pretty exactly" but then seem to contradict yourself by saying the SQL's LIKE command fit's the bill -- even though the SQL LIKE command exists specificly for in-exact matches on field values. Based on your one example above of Albatross, you don't need anything special: don't use ngrams, don't use stemming, don't use fuzzy anything -- just search for "Albatross" and it will match "Albatross" but not "Albert". if you want "Albatross" to match "Albatross Road" use some basic tokenization. If all you really care about is prefix searching (which seems suggested by your "LIKE%" comment above, which i'm guessing is shorthand for something similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both match "abcdef" and "abcd" but neither of them match "abcd" then just use prefix queries (ie: "abcd*") -- they should be plenty efficient for your purposes. you only need to worry about ngrams when you want to efficiently match in the middle of a string. (ie: "TITLE LIKE %ABC%") -Hoss
Re: Solr, SQL Server's LIKE
Hi Devon, Have you considered using a permuterm index? Its workable, but depending on your requirements (size of fields that you want to create the index on), it may bloat your index. I've written about it here: http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html Another alternative which I've implemented is a custom mechanism that retrieves a list of matching unique ids from a database table using a SQL LIKE, then passes this list as a filter to the main query. Its hacky, but I was building a custom handler anyway, so it was quite simple to add in. -sujit On Thu, 2011-12-29 at 11:38 -0600, Devon Baumgarten wrote: > I have been tinkering with Solr for a few weeks, and I am convinced that it > could be very helpful in many of my upcoming projects. I am trying to decide > whether Solr is appropriate for this one, and I haven't had luck looking for > answers on Google. > > I need to search a list of names of companies and individuals pretty exactly. > T-SQL's LIKE operator does this with decent performance, but I have a feeling > there is a way to configure Solr to do this better. I've tried using an edge > N-gram tokenizer, but it feels like it might be more complicated than > necessary. What would you suggest? > > I know this sounds kind of 'Golden Hammer,' but there has been talk of other, > more complicated (magic) searches that I don't think SQL Server can handle, > since its tokens (as far as I know) can't be smaller than one word. > > Thanks, > > Devon Baumgarten >
RE: Solr, SQL Server's LIKE
Erick, Thanks. I know I'll be able to utilize some of Solr's free text searching capabilities in other search types in this project. The product manager wants this particular search to exactly mimic LIKE%. N-Grams get me pretty great results in general, but I don't want the results for this particular search to be fuzzy. How can I prevent the fuzzy matches from appearing? Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather than having a low score. Devon Baumgarten -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, December 29, 2011 3:44 PM To: solr-user@lucene.apache.org Subject: Re: Solr, SQL Server's LIKE SQLs "like" is usually handled with ngrams if you want *stuff* kinds of searches. Wildcards are "interesting" in Solr. Things Solr handles that aren't easy in SQL Phrases, phrases with slop, stemming, synonyms. And, especially, some kind of relevance ranking. But Solr does NOT do the things SQL is best at, things like joins etc. Each has it's sweet spot and trying to make one do all the functions of the other is fraught with places to go wrong. Not a lot of help, but free text searching is what Solr is all about, so if your problem maps into that space, it's a great tool! Best Erick On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant wrote: > for a simple, hackish (albeit inefficient) approach look up wildcard searchers > > e,g foo*, *bar > > > > On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten > wrote: >> I have been tinkering with Solr for a few weeks, and I am convinced that it >> could be very helpful in many of my upcoming projects. I am trying to decide >> whether Solr is appropriate for this one, and I haven't had luck looking for >> answers on Google. >> >> I need to search a list of names of companies and individuals pretty >> exactly. T-SQL's LIKE operator does this with decent performance, but I have >> a feeling there is a way to configure Solr to do this better. I've tried >> using an edge N-gram tokenizer, but it feels like it might be more >> complicated than necessary. What would you suggest? >> >> I know this sounds kind of 'Golden Hammer,' but there has been talk of >> other, more complicated (magic) searches that I don't think SQL Server can >> handle, since its tokens (as far as I know) can't be smaller than one word. >> >> Thanks, >> >> Devon Baumgarten >>
Re: Solr, SQL Server's LIKE
SQLs "like" is usually handled with ngrams if you want *stuff* kinds of searches. Wildcards are "interesting" in Solr. Things Solr handles that aren't easy in SQL Phrases, phrases with slop, stemming, synonyms. And, especially, some kind of relevance ranking. But Solr does NOT do the things SQL is best at, things like joins etc. Each has it's sweet spot and trying to make one do all the functions of the other is fraught with places to go wrong. Not a lot of help, but free text searching is what Solr is all about, so if your problem maps into that space, it's a great tool! Best Erick On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant wrote: > for a simple, hackish (albeit inefficient) approach look up wildcard searchers > > e,g foo*, *bar > > > > On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten > wrote: >> I have been tinkering with Solr for a few weeks, and I am convinced that it >> could be very helpful in many of my upcoming projects. I am trying to decide >> whether Solr is appropriate for this one, and I haven't had luck looking for >> answers on Google. >> >> I need to search a list of names of companies and individuals pretty >> exactly. T-SQL's LIKE operator does this with decent performance, but I have >> a feeling there is a way to configure Solr to do this better. I've tried >> using an edge N-gram tokenizer, but it feels like it might be more >> complicated than necessary. What would you suggest? >> >> I know this sounds kind of 'Golden Hammer,' but there has been talk of >> other, more complicated (magic) searches that I don't think SQL Server can >> handle, since its tokens (as far as I know) can't be smaller than one word. >> >> Thanks, >> >> Devon Baumgarten >>
Re: Solr, SQL Server's LIKE
for a simple, hackish (albeit inefficient) approach look up wildcard searchers e,g foo*, *bar On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten wrote: > I have been tinkering with Solr for a few weeks, and I am convinced that it > could be very helpful in many of my upcoming projects. I am trying to decide > whether Solr is appropriate for this one, and I haven't had luck looking for > answers on Google. > > I need to search a list of names of companies and individuals pretty exactly. > T-SQL's LIKE operator does this with decent performance, but I have a feeling > there is a way to configure Solr to do this better. I've tried using an edge > N-gram tokenizer, but it feels like it might be more complicated than > necessary. What would you suggest? > > I know this sounds kind of 'Golden Hammer,' but there has been talk of other, > more complicated (magic) searches that I don't think SQL Server can handle, > since its tokens (as far as I know) can't be smaller than one word. > > Thanks, > > Devon Baumgarten >