Re: Excluding characters from a wildcard query
: I'm not sure if you can do prefix queries with the fq parameter. You will : need to use the 'q' parameter for that. fq supports anything q supports ... with the QParser and local params options it can be any syntax you want (as long as there is a QParser for it) -Hoss
Re: Excluding characters from a wildcard query - More Info - Is this difficult, or am I being ignored because it's too obvious to merit an answer?
Ben wrote: The exception SOLR raises is : org.apache.lucene.queryParser.ParseException: Cannot parse 'vector:_*[^_]*_[^_]*_[^_]*': Encountered ] at line 1, column 12. Was expecting one of: TO ... RANGEIN_QUOTED ... RANGEIN_GOOP ... Ben wrote: Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching anything with an underscore in the string) using some code like : ... parameters.add(fq, vector:[^_]*_[^_]*); ... seems to cause problems for SOLR, I assume because of the [ or ^ character. Can somebody please advise how to handle character exclusion in such searches? Any help or pointers are much appreciated! Thanks Ben
Re: Excluding characters from a wildcard query
You have to escape all special characters. Even [ to \[ Have a look here http://lucene.apache.org/java/2_4_0/queryparsersyntax.html Uwe 2009/7/1 Ben b...@autonomic.net I only just noticed that this is an exception being thrown by the lucene.queryParser. Should I be mailing on the lucene list, or is it ok here? I'm beginning to wonder if the fq can handle the type of character exclusion I'm trying in the RegExp. Escaping the string also doesn't work : Cannot parse 'vector:_\*[\^_\]\*_[\^_\]\*_[\^_\]\*': Encountered ] at line 1, column 15. Was expecting one of: TO ... RANGEIN_QUOTED ... RANGEIN_GOOP ... Ben wrote: Ben wrote: The exception SOLR raises is : org.apache.lucene.queryParser.ParseException: Cannot parse 'vector:_*[^_]*_[^_]*_[^_]*': Encountered ] at line 1, column 12. Was expecting one of: TO ... RANGEIN_QUOTED ... RANGEIN_GOOP ... Ben wrote: Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching anything with an underscore in the string) using some code like : ... parameters.add(fq, vector:[^_]*_[^_]*); ... seems to cause problems for SOLR, I assume because of the [ or ^ character. Can somebody please advise how to handle character exclusion in such searches? Any help or pointers are much appreciated! Thanks Ben
Re: Excluding characters from a wildcard query
Yes, I had done that... however, I'm beginning to see now that what I am doing is called a wildcard query which is going via Lucene's queryparser. Lucene's query parser doesn't not support the regexp idea of character exclusion ... i.e. I'm not trying to match [ I'm trying to express Match as many characters as possible, which are not underscores with [^_]* Perhaps I'm going about my whole problem in an ineffective way, but I'm not sure how I can sensibly describe what I'm doing without it becoming a long document. The only other approach I can think of is to change what I'm indexing but I'm not sure how to achieve that. I've tried explaining it once, and obviously failed, so I'll try again. I'm given a string containing many vectors (where each dimension is separated by an underscore, and each vector is seperated by a comma) e.g. A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3 I want my facet query to tell me if, within one of the vectors within that string, there is a match for dimensions I'm interested in. Of the four dimensions in this example, I may choose to fix an arbitrary number of them with values, and the rest with wildcards e.g. I might look for a facet containing Ox_*_*_* so one of the vectors in the string must have its first dimension matching Ox and I don't care about the rest. ***Is there a way to break down this string on the comma's so that I can apply a normal wildcard query and SOLR applies it to each individually?*** That would solve all my problems : e.g. The string is internally represented in lucene/solr as A1_B1_C1_D1 A2_B2_C2_D2 A3_B3_C3_D3 where it tries to match the wildcard query on each in turn? Thanks for you help, I'm deeply confused about this at the moment... Ben
Re: Excluding characters from a wildcard query
You should split the strings at the comma yourself and store the values in a multivalued field? Then wildcard search like A1_* are not a problem. I don't know so much about facets. But if they work on multivalued fields that should be then no problem at all. Uwe 2009/7/1 Ben b...@autonomic.net Yes, I had done that... however, I'm beginning to see now that what I am doing is called a wildcard query which is going via Lucene's queryparser. Lucene's query parser doesn't not support the regexp idea of character exclusion ... i.e. I'm not trying to match [ I'm trying to express Match as many characters as possible, which are not underscores with [^_]* Perhaps I'm going about my whole problem in an ineffective way, but I'm not sure how I can sensibly describe what I'm doing without it becoming a long document. The only other approach I can think of is to change what I'm indexing but I'm not sure how to achieve that. I've tried explaining it once, and obviously failed, so I'll try again. I'm given a string containing many vectors (where each dimension is separated by an underscore, and each vector is seperated by a comma) e.g. A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3 I want my facet query to tell me if, within one of the vectors within that string, there is a match for dimensions I'm interested in. Of the four dimensions in this example, I may choose to fix an arbitrary number of them with values, and the rest with wildcards e.g. I might look for a facet containing Ox_*_*_* so one of the vectors in the string must have its first dimension matching Ox and I don't care about the rest. ***Is there a way to break down this string on the comma's so that I can apply a normal wildcard query and SOLR applies it to each individually?*** That would solve all my problems : e.g. The string is internally represented in lucene/solr as A1_B1_C1_D1 A2_B2_C2_D2 A3_B3_C3_D3 where it tries to match the wildcard query on each in turn? Thanks for you help, I'm deeply confused about this at the moment... Ben
Re: Excluding characters from a wildcard query
Is there a way in the Schema to specify that the comma should be used to split the values up? e.g. Can I specify my vector field as multivalue and also specify some sort of tokeniser to automatically split on commas? Ben Uwe Klosa wrote: You should split the strings at the comma yourself and store the values in a multivalued field? Then wildcard search like A1_* are not a problem. I don't know so much about facets. But if they work on multivalued fields that should be then no problem at all. Uwe 2009/7/1 Ben b...@autonomic.net Yes, I had done that... however, I'm beginning to see now that what I am doing is called a wildcard query which is going via Lucene's queryparser. Lucene's query parser doesn't not support the regexp idea of character exclusion ... i.e. I'm not trying to match [ I'm trying to express Match as many characters as possible, which are not underscores with [^_]* Perhaps I'm going about my whole problem in an ineffective way, but I'm not sure how I can sensibly describe what I'm doing without it becoming a long document. The only other approach I can think of is to change what I'm indexing but I'm not sure how to achieve that. I've tried explaining it once, and obviously failed, so I'll try again. I'm given a string containing many vectors (where each dimension is separated by an underscore, and each vector is seperated by a comma) e.g. A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3 I want my facet query to tell me if, within one of the vectors within that string, there is a match for dimensions I'm interested in. Of the four dimensions in this example, I may choose to fix an arbitrary number of them with values, and the rest with wildcards e.g. I might look for a facet containing Ox_*_*_* so one of the vectors in the string must have its first dimension matching Ox and I don't care about the rest. ***Is there a way to break down this string on the comma's so that I can apply a normal wildcard query and SOLR applies it to each individually?*** That would solve all my problems : e.g. The string is internally represented in lucene/solr as A1_B1_C1_D1 A2_B2_C2_D2 A3_B3_C3_D3 where it tries to match the wildcard query on each in turn? Thanks for you help, I'm deeply confused about this at the moment... Ben
Re: Excluding characters from a wildcard query
To get the desired efffect I described you have to do the split before you send the document to solr. I'm not aware of an analyzer that can split one field value into several field values. The analyzers and tokenizers do create tokens from field values in many different ways. As I see it you have to do some preprocessing yourself. Uwe 2009/7/1 Ben b...@autonomic.net Is there a way in the Schema to specify that the comma should be used to split the values up? e.g. Can I specify my vector field as multivalue and also specify some sort of tokeniser to automatically split on commas? Ben Uwe Klosa wrote: You should split the strings at the comma yourself and store the values in a multivalued field? Then wildcard search like A1_* are not a problem. I don't know so much about facets. But if they work on multivalued fields that should be then no problem at all. Uwe 2009/7/1 Ben b...@autonomic.net Yes, I had done that... however, I'm beginning to see now that what I am doing is called a wildcard query which is going via Lucene's queryparser. Lucene's query parser doesn't not support the regexp idea of character exclusion ... i.e. I'm not trying to match [ I'm trying to express Match as many characters as possible, which are not underscores with [^_]* Perhaps I'm going about my whole problem in an ineffective way, but I'm not sure how I can sensibly describe what I'm doing without it becoming a long document. The only other approach I can think of is to change what I'm indexing but I'm not sure how to achieve that. I've tried explaining it once, and obviously failed, so I'll try again. I'm given a string containing many vectors (where each dimension is separated by an underscore, and each vector is seperated by a comma) e.g. A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3 I want my facet query to tell me if, within one of the vectors within that string, there is a match for dimensions I'm interested in. Of the four dimensions in this example, I may choose to fix an arbitrary number of them with values, and the rest with wildcards e.g. I might look for a facet containing Ox_*_*_* so one of the vectors in the string must have its first dimension matching Ox and I don't care about the rest. ***Is there a way to break down this string on the comma's so that I can apply a normal wildcard query and SOLR applies it to each individually?*** That would solve all my problems : e.g. The string is internally represented in lucene/solr as A1_B1_C1_D1 A2_B2_C2_D2 A3_B3_C3_D3 where it tries to match the wildcard query on each in turn? Thanks for you help, I'm deeply confused about this at the moment... Ben
Re: Excluding characters from a wildcard query
I'm not quite sure I understand exactly what you mean. The string I'm processing could have many tens of thousands of values... I hope you aren't implying I'd need to split it into many tens of thousands of columns. If you're saying what I think you're saying, you're saying that I should leave whitespaces between the individual parts of the string, pass in the string into a multiValued field and have SOLR internally treat each word as an individual entity? Thanks for your help with this... Ben Uwe Klosa wrote: To get the desired efffect I described you have to do the split before you send the document to solr. I'm not aware of an analyzer that can split one field value into several field values. The analyzers and tokenizers do create tokens from field values in many different ways. As I see it you have to do some preprocessing yourself. Uwe 2009/7/1 Ben b...@autonomic.net Is there a way in the Schema to specify that the comma should be used to split the values up? e.g. Can I specify my vector field as multivalue and also specify some sort of tokeniser to automatically split on commas? Ben Uwe Klosa wrote: You should split the strings at the comma yourself and store the values in a multivalued field? Then wildcard search like A1_* are not a problem. I don't know so much about facets. But if they work on multivalued fields that should be then no problem at all. Uwe 2009/7/1 Ben b...@autonomic.net Yes, I had done that... however, I'm beginning to see now that what I am doing is called a wildcard query which is going via Lucene's queryparser. Lucene's query parser doesn't not support the regexp idea of character exclusion ... i.e. I'm not trying to match [ I'm trying to express Match as many characters as possible, which are not underscores with [^_]* Perhaps I'm going about my whole problem in an ineffective way, but I'm not sure how I can sensibly describe what I'm doing without it becoming a long document. The only other approach I can think of is to change what I'm indexing but I'm not sure how to achieve that. I've tried explaining it once, and obviously failed, so I'll try again. I'm given a string containing many vectors (where each dimension is separated by an underscore, and each vector is seperated by a comma) e.g. A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3 I want my facet query to tell me if, within one of the vectors within that string, there is a match for dimensions I'm interested in. Of the four dimensions in this example, I may choose to fix an arbitrary number of them with values, and the rest with wildcards e.g. I might look for a facet containing Ox_*_*_* so one of the vectors in the string must have its first dimension matching Ox and I don't care about the rest. ***Is there a way to break down this string on the comma's so that I can apply a normal wildcard query and SOLR applies it to each individually?*** That would solve all my problems : e.g. The string is internally represented in lucene/solr as A1_B1_C1_D1 A2_B2_C2_D2 A3_B3_C3_D3 where it tries to match the wildcard query on each in turn? Thanks for you help, I'm deeply confused about this at the moment... Ben
Re: Excluding characters from a wildcard query
2009/7/1 Ben b...@autonomic.net I'm not quite sure I understand exactly what you mean. The string I'm processing could have many tens of thousands of values... I hope you aren't implying I'd need to split it into many tens of thousands of columns. No, that is not what I meant. It will be one field (column) with tens of thousands of values. If you're saying what I think you're saying, you're saying that I should leave whitespaces between the individual parts of the string, pass in the string into a multiValued field and have SOLR internally treat each word as an individual entity? Thanks for your help with this... I said nothing about whitespaces. I don't know how you update your solr documents. Are you using XML or Solrj? Uwe
Re: Excluding characters from a wildcard query
my brain was switched off. I'm using SOLRJ, which means I'll need to specify multiple : addMultipleFields(solrDoc, vector, vectorvalue, 1.0f); for each value to be added to the multiValuedField. Then, with luck, the simple wildcard query will be executed over each individual value when looking for matches, meaning the simple query syntax can made adequate to do what's needed. Many thanks Uwe. B Uwe Klosa wrote: 2009/7/1 Ben b...@autonomic.net I'm not quite sure I understand exactly what you mean. The string I'm processing could have many tens of thousands of values... I hope you aren't implying I'd need to split it into many tens of thousands of columns. No, that is not what I meant. It will be one field (column) with tens of thousands of values. If you're saying what I think you're saying, you're saying that I should leave whitespaces between the individual parts of the string, pass in the string into a multiValued field and have SOLR internally treat each word as an individual entity? Thanks for your help with this... I said nothing about whitespaces. I don't know how you update your solr documents. Are you using XML or Solrj? Uwe
Re: Excluding characters from a wildcard query
On Wed, Jul 1, 2009 at 5:07 PM, Ben b...@autonomic.net wrote: my brain was switched off. I'm using SOLRJ, which means I'll need to specify multiple : addMultipleFields(solrDoc, vector, vectorvalue, 1.0f); for each value to be added to the multiValuedField. Then, with luck, the simple wildcard query will be executed over each individual value when looking for matches, meaning the simple query syntax can made adequate to do what's needed. I'm not sure if you can do prefix queries with the fq parameter. You will need to use the 'q' parameter for that. You may also want to look at the regex query support in lucene (contrib package). I don't think that is supported out of the box in Solr yet. -- Regards, Shalin Shekhar Mangar.
Excluding characters from a wildcard query
Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching anything with an underscore in the string) using some code like : ... parameters.add(fq, vector:[^_]*_[^_]*); ... seems to cause problems for SOLR, I assume because of the [ or ^ character. Can somebody please advise how to handle character exclusion in such searches? Any help or pointers are much appreciated! Thanks Ben
Re: Excluding characters from a wildcard query - More Info
The exception SOLR raises is : org.apache.lucene.queryParser.ParseException: Cannot parse 'vector:_*[^_]*_[^_]*_[^_]*': Encountered ] at line 1, column 12. Was expecting one of: TO ... RANGEIN_QUOTED ... RANGEIN_GOOP ... Ben wrote: Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching anything with an underscore in the string) using some code like : ... parameters.add(fq, vector:[^_]*_[^_]*); ... seems to cause problems for SOLR, I assume because of the [ or ^ character. Can somebody please advise how to handle character exclusion in such searches? Any help or pointers are much appreciated! Thanks Ben