NGram query failing

2009-10-23 Thread Charlie Jackson
I have a requirement to be able to find hits within words in a free-form
id field. The field can have any type of alphanumeric data - it's as
likely it will be something like "123456" as it is to be "SUN-123-ABC".
I thought of using NGrams to accomplish the task, but I'm having a
problem. I set up a field like this

 









  



 

After indexing a field like this, the analysis page indicates my queries
should work. If I give it a sample field value of "ABC-123456-SUN" and a
query value of "45" it shows hits in several places, which is what I
expected.

 

However, when I actually query the field with something like "45" I get
no hits back. Looking at the debugQuery output, it looks like it's
taking my analyzed query text and putting it into a phrase query. So,
for a query of "45" it turns into a phrase query of :"4 5 45"
which then doesn't hit on anything in my index.

 

What am I missing to make this work?

 

- Charlie



RE: NGram query failing

2009-10-23 Thread Charlie Jackson

Well, I fixed my own problem in the end. For the record, this is the
schema I ended up going with:














I could have left it a trigram but went with a bigram because with this
setup, I can get queries to properly hit as long as the min/max gram
size is met. In other words, for any queries two or more characters
long, this works for me. Less than two characters and it fails. 

I don't know exactly why that is, but I'll take it anyway!

- Charlie


-Original Message-
From: Charlie Jackson [mailto:charlie.jack...@cision.com] 
Sent: Friday, October 23, 2009 10:00 AM
To: solr-user@lucene.apache.org
Subject: NGram query failing

I have a requirement to be able to find hits within words in a free-form
id field. The field can have any type of alphanumeric data - it's as
likely it will be something like "123456" as it is to be "SUN-123-ABC".
I thought of using NGrams to accomplish the task, but I'm having a
problem. I set up a field like this

 









  



 

After indexing a field like this, the analysis page indicates my queries
should work. If I give it a sample field value of "ABC-123456-SUN" and a
query value of "45" it shows hits in several places, which is what I
expected.

 

However, when I actually query the field with something like "45" I get
no hits back. Looking at the debugQuery output, it looks like it's
taking my analyzed query text and putting it into a phrase query. So,
for a query of "45" it turns into a phrase query of :"4 5 45"
which then doesn't hit on anything in my index.

 

What am I missing to make this work?

 

- Charlie



Re: NGram query failing

2009-11-11 Thread Otis Gospodnetic
That's actually easy to explain/understand.
If the min n-gram size is 3, a query term with just 2 characters will ever 
match any terms that originally had > 2 characters because longer terms will 
never get tokenized into terms below 3-character tokens.

Take the term: house
house => hou ous use

If you search term is "ho", it will never match the above, as there is no term 
"ho" in there.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: Charlie Jackson 
> To: solr-user@lucene.apache.org
> Sent: Fri, October 23, 2009 4:32:33 PM
> Subject: RE: NGram query failing
> 
> 
> Well, I fixed my own problem in the end. For the record, this is the
> schema I ended up going with:
> 
> 
> 
> 
> 
> 
> minGramSize="2" />
> 
> 
> 
> 
> 
> minGramSize="2"/>
> 
> 
> 
> I could have left it a trigram but went with a bigram because with this
> setup, I can get queries to properly hit as long as the min/max gram
> size is met. In other words, for any queries two or more characters
> long, this works for me. Less than two characters and it fails. 
> 
> I don't know exactly why that is, but I'll take it anyway!
> 
> - Charlie
> 
> 
> -----Original Message-
> From: Charlie Jackson [mailto:charlie.jack...@cision.com] 
> Sent: Friday, October 23, 2009 10:00 AM
> To: solr-user@lucene.apache.org
> Subject: NGram query failing
> 
> I have a requirement to be able to find hits within words in a free-form
> id field. The field can have any type of alphanumeric data - it's as
> likely it will be something like "123456" as it is to be "SUN-123-ABC".
> I thought of using NGrams to accomplish the task, but I'm having a
> problem. I set up a field like this
> 
> 
> 
> 
> positionIncrementGap="100">
> 
> 
> 
> 
> minGramSize="1" maxGramSize="3"/>
> 
> 
> 
>   
> 
> 
> 
> 
> 
> After indexing a field like this, the analysis page indicates my queries
> should work. If I give it a sample field value of "ABC-123456-SUN" and a
> query value of "45" it shows hits in several places, which is what I
> expected.
> 
> 
> 
> However, when I actually query the field with something like "45" I get
> no hits back. Looking at the debugQuery output, it looks like it's
> taking my analyzed query text and putting it into a phrase query. So,
> for a query of "45" it turns into a phrase query of :"4 5 45"
> which then doesn't hit on anything in my index.
> 
> 
> 
> What am I missing to make this work?
> 
> 
> 
> - Charlie