Re: leading and trailing wildcard query
Just for the records - this works like a charm: .../select?q=*potter*qt=dismax response − lst name=responseHeader int name=status0/int int name=QTime93/int − lst name=params str name=q*potter*/str str name=qtdismax/str /lst /lst − result name=response numFound=572 start=0 maxScore=5.3375173 ... str name=titleL'année où on a découvert «Harry Potter» au cinéma/str ... requestHandler name=dismax class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsexplicit/str float name=tie0.01/float str name=qf all_text_de^0.5 all_text_en^0.5 all_text_es^0.5 all_text_fr^0.5 all_text_it^0.5 all_text_nl^0.5 all_text_nolang^0.5 channel_name_tokens^1.0 role_tokens^1.0 participant_tokens^1.0/str str name=pf title_de^2 title_en^2 title_es^2 title_fr^2 title_it^2 title_nl^2 title_nolang^2 channel_name_tokens^2 role_tokens^2 participant_tokens^2/str /str-- str name=fl *,score /str str name=mm 2lt;-1 5lt;80%/str int name=ps100/int str name=q.alt*:*/str /lst /requestHandler And the funny thing: ReversedWildcardFilterFactory is still commented out (I didn't remember I never reactivated it). And NGram was never part of my schema. Happy user of 1.4RC - I'm sure our milestones won't beat the SOLR 1.4 release date. Cheers, Chantal
Re: leading and trailing wildcard query
No thoughts on this? Really!? I would hate to admit to my Oracle DBE that Solr can't be customized to do a common query that a relational database can do. :-( On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson a.steven.ander...@gmail.com wrote: I've scoured the archives and JIRA , but the answer to my question is just not clear to me. With all the new Solr 1.4 features, is there any way to do a leading and trailing wildcard query on an *untokenized* field? e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx Yes, I know how expensive such a query would be, but we have the user requirement, nonetheless. If not, any suggestions on how to implement a custom solution using Solr? Using an external data structure? -- A. Steven Anderson
Re: leading and trailing wildcard query
The guilt trick is not the best thing to try on public mailing lists. :) The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one. The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: A. Steven Anderson a.steven.ander...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 5, 2009 3:04:32 PM Subject: Re: leading and trailing wildcard query No thoughts on this? Really!? I would hate to admit to my Oracle DBE that Solr can't be customized to do a common query that a relational database can do. :-( On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson a.steven.ander...@gmail.com wrote: I've scoured the archives and JIRA , but the answer to my question is just not clear to me. With all the new Solr 1.4 features, is there any way to do a leading and trailing wildcard query on an *untokenized* field? e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx Yes, I know how expensive such a query would be, but we have the user requirement, nonetheless. If not, any suggestions on how to implement a custom solution using Solr? Using an external data structure? -- A. Steven Anderson
Re: leading and trailing wildcard query
The guilt trick is not the best thing to try on public mailing lists. :) Point taken, although not my intention. I guess I have been spoiled by quick replies and was getting to think it was a stupid question. Plus, I'm literally gonna get trash talk from my Oracle DBE if I can't make this work. ;-) We've basically relegated Oracle to handling ingest from which we index Solr and provide all search features. I'd hate to have to succumb to using Oracle to service this one special query. The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one. Please elaborate. What do you mean by *desrever* string? The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams. Well, that's the problem. The field may have non-Latin characters that may not have whitespace nor punctuation. -- A. Steven Anderson
RE: leading and trailing wildcard query
I've just set up something similar (much thanks to Avesh!)- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=doubleedgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType . . field name=beginswith type=edgytext indexed=true stored=false multiValued=true/ field name=contains type=doubleedgytext indexed=true stored=false multiValued=true/ . . !-- Copy for BEGINSWITH search -- copyField source=content dest=beginswith/ copyField source=*_t dest=beginswith/ copyField source=*_mt dest=beginswith/ !-- Copy for CONTAINS search -- copyField source=content dest=contains/ copyField source=*_t dest=contains/ copyField source=*_mt dest=contains/ bern -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Friday, 6 November 2009 9:13 AM To: solr-user@lucene.apache.org Subject: Re: leading and trailing wildcard query The guilt trick is not the best thing to try on public mailing lists. :) The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one. The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: A. Steven Anderson a.steven.ander...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 5, 2009 3:04:32 PM Subject: Re: leading and trailing wildcard query No thoughts on this? Really!? I would hate to admit to my Oracle DBE that Solr can't be customized to do a common query that a relational database can do. :-( On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson a.steven.ander...@gmail.com wrote: I've scoured the archives and JIRA , but the answer to my question is just not clear to me. With all the new Solr 1.4 features, is there any way to do a leading and trailing wildcard query on an *untokenized* field? e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx Yes, I know how expensive such a query would be, but we have the user requirement, nonetheless. If not, any suggestions on how to implement a custom solution using Solr? Using an external data structure? -- A. Steven Anderson
Re: leading and trailing wildcard query
Doesn't it work to call SolrQueryParser.setAllowLeadingWildcard? It can be really slow, what an RDBMS person would call a full table scan. There is an open bug to make that settable in a config file, but this is a pretty tiny change to the source. http://issues.apache.org/jira/browse/SOLR-218 wunder On Nov 5, 2009, at 2:13 PM, Otis Gospodnetic wrote: The guilt trick is not the best thing to try on public mailing lists. :) The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one. The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: A. Steven Anderson a.steven.ander...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 5, 2009 3:04:32 PM Subject: Re: leading and trailing wildcard query No thoughts on this? Really!? I would hate to admit to my Oracle DBE that Solr can't be customized to do a common query that a relational database can do. :-( On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson a.steven.ander...@gmail.com wrote: I've scoured the archives and JIRA , but the answer to my question is just not clear to me. With all the new Solr 1.4 features, is there any way to do a leading and trailing wildcard query on an *untokenized* field? e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx Yes, I know how expensive such a query would be, but we have the user requirement, nonetheless. If not, any suggestions on how to implement a custom solution using Solr? Using an external data structure? -- A. Steven Anderson
Re: leading and trailing wildcard query
Thanks for the solution, but could you elaborate on how it would find something like *abc* in a field that contains abc. Steve On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton bernadette.hough...@deakin.edu.au wrote: I've just set up something similar (much thanks to Avesh!)- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=doubleedgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType . . field name=beginswith type=edgytext indexed=true stored=false multiValued=true/ field name=contains type=doubleedgytext indexed=true stored=false multiValued=true/ . . !-- Copy for BEGINSWITH search -- copyField source=content dest=beginswith/ copyField source=*_t dest=beginswith/ copyField source=*_mt dest=beginswith/ !-- Copy for CONTAINS search -- copyField source=content dest=contains/ copyField source=*_t dest=contains/ copyField source=*_mt dest=contains/ bern
Re: leading and trailing wildcard query
Because that is the semantics of Solr/Lucene wildcard syntax. * stands for any number of any character. Basically, it enumerates all the terms in the field for all the documents and assembles a list of all of them that contain the substring abc and uses that as one of the clauses of your search... Best Erick On Thu, Nov 5, 2009 at 6:07 PM, A. Steven Anderson a.steven.ander...@gmail.com wrote: Thanks for the solution, but could you elaborate on how it would find something like *abc* in a field that contains abc. Steve On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton bernadette.hough...@deakin.edu.au wrote: I've just set up something similar (much thanks to Avesh!)- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=doubleedgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType . . field name=beginswith type=edgytext indexed=true stored=false multiValued=true/ field name=contains type=doubleedgytext indexed=true stored=false multiValued=true/ . . !-- Copy for BEGINSWITH search -- copyField source=content dest=beginswith/ copyField source=*_t dest=beginswith/ copyField source=*_mt dest=beginswith/ !-- Copy for CONTAINS search -- copyField source=content dest=contains/ copyField source=*_t dest=contains/ copyField source=*_mt dest=contains/ bern
Re: leading and trailing wildcard query
Doesn't it work to call SolrQueryParser.setAllowLeadingWildcard? Good question. Anyone? It can be really slow, what an RDBMS person would call a full table scan. Understood. There is an open bug to make that settable in a config file, but this is a pretty tiny change to the source. http://issues.apache.org/jira/browse/SOLR-218 Unfortunately, we can only use official releases (not even snapshots) since it's a government-related project. -- A. Steven Anderson
RE: leading and trailing wildcard query
Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the doubleedgytext, and would be retrievable by a query such as contains:abc. Note that you can set the max and minimum size of strings that get indexed. bern -Original Message- From: A. Steven Anderson [mailto:a.steven.ander...@gmail.com] Sent: Friday, 6 November 2009 10:08 AM To: solr-user@lucene.apache.org Subject: Re: leading and trailing wildcard query Thanks for the solution, but could you elaborate on how it would find something like *abc* in a field that contains abc. Steve On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton bernadette.hough...@deakin.edu.au wrote: I've just set up something similar (much thanks to Avesh!)- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=doubleedgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType . . field name=beginswith type=edgytext indexed=true stored=false multiValued=true/ field name=contains type=doubleedgytext indexed=true stored=false multiValued=true/ . . !-- Copy for BEGINSWITH search -- copyField source=content dest=beginswith/ copyField source=*_t dest=beginswith/ copyField source=*_mt dest=beginswith/ !-- Copy for CONTAINS search -- copyField source=content dest=contains/ copyField source=*_t dest=contains/ copyField source=*_mt dest=contains/ bern
Re: leading and trailing wildcard query
Ah. With that restriction, it is impossible. If it is OK to pay Lucid to make a one-line change, you might be able to do it. Otherwise, get ready to spend a lot of money for a search engine. wunder On Nov 5, 2009, at 3:18 PM, A. Steven Anderson wrote: Unfortunately, we can only use official releases (not even snapshots) since it's a government-related project. -- A. Steven Anderson
Re: leading and trailing wildcard query
Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the doubleedgytext, and would be retrievable by a query such as contains:abc. Note that you can set the max and minimum size of strings that get indexed. Excellent! Just to clarify though, NGramFilterFactor is a Solr 1.4 feature only, correct? -- A. Steven Anderson
Re: leading and trailing wildcard query
Note that N-grams are limited to specific string lengths. I presume that you need to search for arbitrary strings, not just three-letter ones. wunder On Nov 5, 2009, at 3:23 PM, Bernadette Houghton wrote: Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the doubleedgytext, and would be retrievable by a query such as contains:abc. Note that you can set the max and minimum size of strings that get indexed. bern -Original Message- From: A. Steven Anderson [mailto:a.steven.ander...@gmail.com] Sent: Friday, 6 November 2009 10:08 AM To: solr-user@lucene.apache.org Subject: Re: leading and trailing wildcard query Thanks for the solution, but could you elaborate on how it would find something like *abc* in a field that contains abc. Steve On Thu, Nov 5, 2009 at 5:25 PM, Bernadette Houghton bernadette.hough...@deakin.edu.au wrote: I've just set up something similar (much thanks to Avesh!)- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=doubleedgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=5 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType . . field name=beginswith type=edgytext indexed=true stored=false multiValued=true/ field name=contains type=doubleedgytext indexed=true stored=false multiValued=true/ . . !-- Copy for BEGINSWITH search -- copyField source=content dest=beginswith/ copyField source=*_t dest=beginswith/ copyField source=*_mt dest=beginswith/ !-- Copy for CONTAINS search -- copyField source=content dest=contains/ copyField source=*_t dest=contains/ copyField source=*_mt dest=contains/ bern
Re: leading and trailing wildcard query
Ah. With that restriction, it is impossible. If it is OK to pay Lucid to make a one-line change, you might be able to do it. Otherwise, get ready to spend a lot of money for a search engine. Well, now that Lucid is getting In-Q-Tel $$$, they will soon learn that officially releases are all that matters, and 12-18 month release cycles are not acceptable. ;-) -- A. Steven Anderson
Re: leading and trailing wildcard query
Note that N-grams are limited to specific string lengths. I presume that you need to search for arbitrary strings, not just three-letter ones. Understood, but that is a limitation that we can live with. Thanks! -- A. Steven Anderson
RE: leading and trailing wildcard query
Not sure what version it was supported from, but we're on 1.3. bern -Original Message- From: A. Steven Anderson [mailto:a.steven.ander...@gmail.com] Sent: Friday, 6 November 2009 10:25 AM To: solr-user@lucene.apache.org Subject: Re: leading and trailing wildcard query Hi Steve, a query such as *abc* would need the NGramFilterFactor, hence the doubleedgytext, and would be retrievable by a query such as contains:abc. Note that you can set the max and minimum size of strings that get indexed. Excellent! Just to clarify though, NGramFilterFactor is a Solr 1.4 feature only, correct? -- A. Steven Anderson
Re: leading and trailing wildcard query
Not sure what version it was supported from, but we're on 1.3. Really!? Great answer! Thanks! -- A. Steven Anderson
Re: leading and trailing wildcard query
A. Steven Anderson wrote: No thoughts on this? Really!? I would hate to admit to my Oracle DBE that Solr can't be customized to do a common query that a relational database can do. :-( On Wed, Nov 4, 2009 at 6:01 PM, A. Steven Anderson a.steven.ander...@gmail.com wrote: I've scoured the archives and JIRA , but the answer to my question is just not clear to me. With all the new Solr 1.4 features, is there any way to do a leading and trailing wildcard query on an *untokenized* field? e.g. q=myfield:*abc* would return a doc with myfield=xxxabcxxx Yes, I know how expensive such a query would be, but we have the user requirement, nonetheless. If not, any suggestions on how to implement a custom solution using Solr? Using an external data structure? You can use ReversedWildcardFilterFactory that creates additional tokens (in your case, a single additional token :) ) that is reversed, _and_ also triggers the setAllowLeadingWildcards in the QueryParser - won't help much with the performance though, due to the trailing wildcard in your original query. Please see the discussion in SOLR-1321 (this will be available in 1.4 but it should be easy to patch 1.3 to use it). If you really need to support such queries efficiently you should implement a full permu-term indexing, i.e. a token filter that rotates tokens and adds all rotations (with a special marker to mark the beginning of the word), and a query plugin that detects such query terms and rotates the query term appropriately. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: leading and trailing wildcard query
Please elaborate. What do you mean by *desrever* string? Try reading in reverse ;). Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: A. Steven Anderson a.steven.ander...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 5, 2009 5:23:48 PM Subject: Re: leading and trailing wildcard query The guilt trick is not the best thing to try on public mailing lists. :) Point taken, although not my intention. I guess I have been spoiled by quick replies and was getting to think it was a stupid question. Plus, I'm literally gonna get trash talk from my Oracle DBE if I can't make this work. ;-) We've basically relegated Oracle to handling ingest from which we index Solr and provide all search features. I'd hate to have to succumb to using Oracle to service this one special query. The first thing that popped to my mind is to use 2 fields, where the second one contains the desrever string of the first one. Please elaborate. What do you mean by *desrever* string? The second idea is to use n-grams (if it's OK to tokenize), more specifically edge n-grams. Well, that's the problem. The field may have non-Latin characters that may not have whitespace nor punctuation. -- A. Steven Anderson