subject:"RE\: Solr, SQL Server's LIKE"

RE: Solr, SQL Server's LIKE

2012-01-04 Thread Devon Baumgarten

Great suggestion! Thanks for keeping it simple for a complete Solr newbie.

I'm going to go try this right now.

Thanks!
Devon Baumgarten

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, January 02, 2012 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:
> N-Grams get me pretty great results in general, but I don't want the results 
> for this particular search to be fuzzy. How can I prevent the fuzzy matches 
> from appearing?
>
> Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather 
> than having a low score.

To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.

Thanks,
Shawn

Re: Solr, SQL Server's LIKE

2012-01-02 Thread Shawn Heisey


On 12/29/2011 3:51 PM, Devon Baumgarten wrote:

N-Grams get me pretty great results in general, but I don't want the results 
for this particular search to be fuzzy. How can I prevent the fuzzy matches 
from appearing?

Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather 
than having a low score.


To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.


Thanks,
Shawn

Re: Solr, SQL Server's LIKE

2012-01-02 Thread Chantal Ackermann


Thanks, Erick! That sounds great. I really do have to upgrade.

Chantal


On Sun, 2012-01-01 at 16:42 +0100, Erick Erickson wrote:
> Chantal:
> 
> bq: The problem with the wildcard searches is that the input is not
> analyzed.
> 
> As of 3.6/4.0, this is no longer entirely true. Some analysis is
> performed for wildcard searches by default and you can
> specify most anything you want if you really need to see:
> https://issues.apache.org/jira/browse/SOLR-2438
> and
> http://wiki.apache.org/solr/MultitermQueryAnalysis
> 
> Best
> Erick

Re: Solr, SQL Server's LIKE

2012-01-01 Thread Erick Erickson

Chantal:

bq: The problem with the wildcard searches is that the input is not
analyzed.

As of 3.6/4.0, this is no longer entirely true. Some analysis is
performed for wildcard searches by default and you can
specify most anything you want if you really need to see:
https://issues.apache.org/jira/browse/SOLR-2438
and
http://wiki.apache.org/solr/MultitermQueryAnalysis

Best
Erick

On Fri, Dec 30, 2011 at 4:33 PM, Devon Baumgarten
 wrote:
> Hoss,
>
> Thanks. You've answered my question. To clarify, what I should have asked for 
> instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me 
> that I didn't need n-grams to use the wildcard. You asking for me to clarify 
> what I meant made me realize that the n-grams are the source of all my 
> current problems. :)
>
> Thanks!
>
> Devon Baumgarten
>
>
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Thursday, December 29, 2011 7:00 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr, SQL Server's LIKE
>
>
> : Thanks. I know I'll be able to utilize some of Solr's free text
> : searching capabilities in other search types in this project. The
> : product manager wants this particular search to exactly mimic LIKE%.
>        ...
> : Ex: If I search "Albatross" I want "Albert" to be excluded completely,
> : rather than having a low score.
>
> please be specific about the types of queries you want. ie: we need more
> then one example of the type of input you want to provide, the type of
> matches you want to see for that input, and the type of matches you want
> to get back.
>
> in your first message you said you need to match company titles "pretty
> exactly" but then seem to contradict yourself by saying the SQL's LIKE
> command fit's the bill -- even though the SQL LIKE command exists
> specificly for in-exact matches on field values.
>
> Based on your one example above of Albatross, you don't need anything
> special: don't use ngrams, don't use stemming, don't use fuzzy anything --
> just search for "Albatross" and it will match "Albatross" but not
> "Albert".  if you want "Albatross" to match "Albatross Road" use some
> basic tokenization.
>
> If all you really care about is prefix searching (which seems suggested by
> your "LIKE%" comment above, which i'm guessing is shorthand for something
> similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both
> match "abcdef" and "abcd" but neither of them match "abcd"
> then just use prefix queries (ie: "abcd*") -- they should be plenty
> efficient for your purposes.  you only need to worry about ngrams when you
> want to efficiently match in the middle of a string. (ie: "TITLE LIKE
> %ABC%")
>
>
> -Hoss

RE: Solr, SQL Server's LIKE

2011-12-30 Thread Devon Baumgarten

Hoss,

Thanks. You've answered my question. To clarify, what I should have asked for 
instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that 
I didn't need n-grams to use the wildcard. You asking for me to clarify what I 
meant made me realize that the n-grams are the source of all my current 
problems. :)

Thanks!

Devon Baumgarten


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, December 29, 2011 7:00 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr, SQL Server's LIKE


: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
...
: Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles "pretty 
exactly" but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for "Albatross" and it will match "Albatross" but not 
"Albert".  if you want "Albatross" to match "Albatross Road" use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your "LIKE%" comment above, which i'm guessing is shorthand for something 
similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
match "abcdef" and "abcd" but neither of them match "abcd" 
then just use prefix queries (ie: "abcd*") -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
%ABC%")


-Hoss

RE: Solr, SQL Server's LIKE

2011-12-30 Thread Chantal Ackermann


The problem with the wildcard searches is that the input is not
analyzed. For english, this might not be such a problem (except if you
expect case insenstive search). But than again, you don't get that with
like, either. Ngrams bring that and more.

What I think is often forgotten when comparing 'like' and Solr search
is:
Solr's analyzer allow not only for case insenstive search but also for
other analysis such as removing diacritics and this is also applied when
sorting (you have to create a separate index in the DB, as well, if you
want that).

Say you have the following names:
'Van Hinden'
'van Hinden'
'Música'
'Musil'

like 'mu%' - no hits
like 'Mu%' - 1 hit
like 'van%' - 1 hit
like 'hin%' - no hits

with Solr whitespace or standard tokenizer, ngrams and a diacritcs and
lowercase filter (no wildcard search):
'mu'/'Mu' - 2 hits sorted ignoring case and diacritics
'van' - 2 hits
'hin' - 2 hits


(This is written down from experience. I haven't checked those examples
explicitly.)

Cheers,
Chantal



On Fri, 2011-12-30 at 02:00 +0100, Chris Hostetter wrote:
> : Thanks. I know I'll be able to utilize some of Solr's free text 
> : searching capabilities in other search types in this project. The 
> : product manager wants this particular search to exactly mimic LIKE%.
>   ...
> : Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
> : rather than having a low score.
> 
> please be specific about the types of queries you want. ie: we need more 
> then one example of the type of input you want to provide, the type of 
> matches you want to see for that input, and the type of matches you want 
> to get back.
> 
> in your first message you said you need to match company titles "pretty 
> exactly" but then seem to contradict yourself by saying the SQL's LIKE 
> command fit's the bill -- even though the SQL LIKE command exists 
> specificly for in-exact matches on field values.
> 
> Based on your one example above of Albatross, you don't need anything 
> special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
> just search for "Albatross" and it will match "Albatross" but not 
> "Albert".  if you want "Albatross" to match "Albatross Road" use some 
> basic tokenization.
> 
> If all you really care about is prefix searching (which seems suggested by 
> your "LIKE%" comment above, which i'm guessing is shorthand for something 
> similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
> match "abcdef" and "abcd" but neither of them match "abcd" 
> then just use prefix queries (ie: "abcd*") -- they should be plenty 
> efficient for your purposes.  you only need to worry about ngrams when you 
> want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
> %ABC%")
> 
> 
> -Hoss

RE: Solr, SQL Server's LIKE

2011-12-29 Thread Chris Hostetter


: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
...
: Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles "pretty 
exactly" but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for "Albatross" and it will match "Albatross" but not 
"Albert".  if you want "Albatross" to match "Albatross Road" use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your "LIKE%" comment above, which i'm guessing is shorthand for something 
similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
match "abcdef" and "abcd" but neither of them match "abcd" 
then just use prefix queries (ie: "abcd*") -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
%ABC%")


-Hoss

Re: Solr, SQL Server's LIKE

2011-12-29 Thread Sujit Pal

Hi Devon,

Have you considered using a permuterm index? Its workable, but depending
on your requirements (size of fields that you want to create the index
on), it may bloat your index. I've written about it here:
http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html 

Another alternative which I've implemented is a custom mechanism that
retrieves a list of matching unique ids from a database table using a
SQL LIKE, then passes this list as a filter to the main query. Its
hacky, but I was building a custom handler anyway, so it was quite
simple to add in.

-sujit

On Thu, 2011-12-29 at 11:38 -0600, Devon Baumgarten wrote:
> I have been tinkering with Solr for a few weeks, and I am convinced that it 
> could be very helpful in many of my upcoming projects. I am trying to decide 
> whether Solr is appropriate for this one, and I haven't had luck looking for 
> answers on Google.
> 
> I need to search a list of names of companies and individuals pretty exactly. 
> T-SQL's LIKE operator does this with decent performance, but I have a feeling 
> there is a way to configure Solr to do this better. I've tried using an edge 
> N-gram tokenizer, but it feels like it might be more complicated than 
> necessary. What would you suggest?
> 
> I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
> more complicated (magic) searches that I don't think SQL Server can handle, 
> since its tokens (as far as I know) can't be smaller than one word.
> 
> Thanks,
> 
> Devon Baumgarten
>

RE: Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten

Erick,

Thanks. I know I'll be able to utilize some of Solr's free text searching 
capabilities in other search types in this project. The product manager wants 
this particular search to exactly mimic LIKE%.

N-Grams get me pretty great results in general, but I don't want the results 
for this particular search to be fuzzy. How can I prevent the fuzzy matches 
from appearing?

Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather 
than having a low score.

Devon Baumgarten

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, December 29, 2011 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

SQLs "like" is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are "interesting"
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant  wrote:
> for a simple, hackish (albeit inefficient) approach look up wildcard searchers
>
> e,g foo*, *bar
>
>
>
> On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
>  wrote:
>> I have been tinkering with Solr for a few weeks, and I am convinced that it 
>> could be very helpful in many of my upcoming projects. I am trying to decide 
>> whether Solr is appropriate for this one, and I haven't had luck looking for 
>> answers on Google.
>>
>> I need to search a list of names of companies and individuals pretty 
>> exactly. T-SQL's LIKE operator does this with decent performance, but I have 
>> a feeling there is a way to configure Solr to do this better. I've tried 
>> using an edge N-gram tokenizer, but it feels like it might be more 
>> complicated than necessary. What would you suggest?
>>
>> I know this sounds kind of 'Golden Hammer,' but there has been talk of 
>> other, more complicated (magic) searches that I don't think SQL Server can 
>> handle, since its tokens (as far as I know) can't be smaller than one word.
>>
>> Thanks,
>>
>> Devon Baumgarten
>>

Re: Solr, SQL Server's LIKE

2011-12-29 Thread Erick Erickson

SQLs "like" is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are "interesting"
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant  wrote:
> for a simple, hackish (albeit inefficient) approach look up wildcard searchers
>
> e,g foo*, *bar
>
>
>
> On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
>  wrote:
>> I have been tinkering with Solr for a few weeks, and I am convinced that it 
>> could be very helpful in many of my upcoming projects. I am trying to decide 
>> whether Solr is appropriate for this one, and I haven't had luck looking for 
>> answers on Google.
>>
>> I need to search a list of names of companies and individuals pretty 
>> exactly. T-SQL's LIKE operator does this with decent performance, but I have 
>> a feeling there is a way to configure Solr to do this better. I've tried 
>> using an edge N-gram tokenizer, but it feels like it might be more 
>> complicated than necessary. What would you suggest?
>>
>> I know this sounds kind of 'Golden Hammer,' but there has been talk of 
>> other, more complicated (magic) searches that I don't think SQL Server can 
>> handle, since its tokens (as far as I know) can't be smaller than one word.
>>
>> Thanks,
>>
>> Devon Baumgarten
>>

Re: Solr, SQL Server's LIKE

2011-12-29 Thread Shashi Kant

for a simple, hackish (albeit inefficient) approach look up wildcard searchers

e,g foo*, *bar



On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
 wrote:
> I have been tinkering with Solr for a few weeks, and I am convinced that it 
> could be very helpful in many of my upcoming projects. I am trying to decide 
> whether Solr is appropriate for this one, and I haven't had luck looking for 
> answers on Google.
>
> I need to search a list of names of companies and individuals pretty exactly. 
> T-SQL's LIKE operator does this with decent performance, but I have a feeling 
> there is a way to configure Solr to do this better. I've tried using an edge 
> N-gram tokenizer, but it feels like it might be more complicated than 
> necessary. What would you suggest?
>
> I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
> more complicated (magic) searches that I don't think SQL Server can handle, 
> since its tokens (as far as I know) can't be smaller than one word.
>
> Thanks,
>
> Devon Baumgarten
>

RE: Solr, SQL Server's LIKE

Re: Solr, SQL Server's LIKE

Re: Solr, SQL Server's LIKE

Re: Solr, SQL Server's LIKE

RE: Solr, SQL Server's LIKE

RE: Solr, SQL Server's LIKE

RE: Solr, SQL Server's LIKE

Re: Solr, SQL Server's LIKE

RE: Solr, SQL Server's LIKE

Re: Solr, SQL Server's LIKE

Re: Solr, SQL Server's LIKE

11 matches

Site Navigation

Mail list logo

Footer information