NGram query failing
I have a requirement to be able to find hits within words in a free-form id field. The field can have any type of alphanumeric data - it's as likely it will be something like "123456" as it is to be "SUN-123-ABC". I thought of using NGrams to accomplish the task, but I'm having a problem. I set up a field like this After indexing a field like this, the analysis page indicates my queries should work. If I give it a sample field value of "ABC-123456-SUN" and a query value of "45" it shows hits in several places, which is what I expected. However, when I actually query the field with something like "45" I get no hits back. Looking at the debugQuery output, it looks like it's taking my analyzed query text and putting it into a phrase query. So, for a query of "45" it turns into a phrase query of :"4 5 45" which then doesn't hit on anything in my index. What am I missing to make this work? - Charlie
RE: NGram query failing
Well, I fixed my own problem in the end. For the record, this is the schema I ended up going with: I could have left it a trigram but went with a bigram because with this setup, I can get queries to properly hit as long as the min/max gram size is met. In other words, for any queries two or more characters long, this works for me. Less than two characters and it fails. I don't know exactly why that is, but I'll take it anyway! - Charlie -Original Message- From: Charlie Jackson [mailto:charlie.jack...@cision.com] Sent: Friday, October 23, 2009 10:00 AM To: solr-user@lucene.apache.org Subject: NGram query failing I have a requirement to be able to find hits within words in a free-form id field. The field can have any type of alphanumeric data - it's as likely it will be something like "123456" as it is to be "SUN-123-ABC". I thought of using NGrams to accomplish the task, but I'm having a problem. I set up a field like this After indexing a field like this, the analysis page indicates my queries should work. If I give it a sample field value of "ABC-123456-SUN" and a query value of "45" it shows hits in several places, which is what I expected. However, when I actually query the field with something like "45" I get no hits back. Looking at the debugQuery output, it looks like it's taking my analyzed query text and putting it into a phrase query. So, for a query of "45" it turns into a phrase query of :"4 5 45" which then doesn't hit on anything in my index. What am I missing to make this work? - Charlie
Re: NGram query failing
That's actually easy to explain/understand. If the min n-gram size is 3, a query term with just 2 characters will ever match any terms that originally had > 2 characters because longer terms will never get tokenized into terms below 3-character tokens. Take the term: house house => hou ous use If you search term is "ho", it will never match the above, as there is no term "ho" in there. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: Charlie Jackson > To: solr-user@lucene.apache.org > Sent: Fri, October 23, 2009 4:32:33 PM > Subject: RE: NGram query failing > > > Well, I fixed my own problem in the end. For the record, this is the > schema I ended up going with: > > > > > > > minGramSize="2" /> > > > > > > minGramSize="2"/> > > > > I could have left it a trigram but went with a bigram because with this > setup, I can get queries to properly hit as long as the min/max gram > size is met. In other words, for any queries two or more characters > long, this works for me. Less than two characters and it fails. > > I don't know exactly why that is, but I'll take it anyway! > > - Charlie > > > -----Original Message- > From: Charlie Jackson [mailto:charlie.jack...@cision.com] > Sent: Friday, October 23, 2009 10:00 AM > To: solr-user@lucene.apache.org > Subject: NGram query failing > > I have a requirement to be able to find hits within words in a free-form > id field. The field can have any type of alphanumeric data - it's as > likely it will be something like "123456" as it is to be "SUN-123-ABC". > I thought of using NGrams to accomplish the task, but I'm having a > problem. I set up a field like this > > > > > positionIncrementGap="100"> > > > > > minGramSize="1" maxGramSize="3"/> > > > > > > > > > > After indexing a field like this, the analysis page indicates my queries > should work. If I give it a sample field value of "ABC-123456-SUN" and a > query value of "45" it shows hits in several places, which is what I > expected. > > > > However, when I actually query the field with something like "45" I get > no hits back. Looking at the debugQuery output, it looks like it's > taking my analyzed query text and putting it into a phrase query. So, > for a query of "45" it turns into a phrase query of :"4 5 45" > which then doesn't hit on anything in my index. > > > > What am I missing to make this work? > > > > - Charlie