Re: Excluding characters from a wildcard query

2009-07-02 Thread Chris Hostetter

: I'm not sure if you can do prefix queries with the fq parameter. You will
: need to use the 'q' parameter for that.

fq supports anything q supports ... with the QParser and local params 
options it can be any syntax you want (as long as there is a QParser for 
it)


-Hoss



Re: Excluding characters from a wildcard query - More Info - Is this difficult, or am I being ignored because it's too obvious to merit an answer?

2009-07-01 Thread Ben


Ben wrote:

The exception SOLR raises is :

org.apache.lucene.queryParser.ParseException: Cannot parse 
'vector:_*[^_]*_[^_]*_[^_]*': Encountered ] at line 1, column 12.

Was expecting one of:
   TO ...
   RANGEIN_QUOTED ...
   RANGEIN_GOOP ...
 
Ben wrote:
Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching 
anything with an underscore in the string) using some code like :


...
parameters.add(fq, vector:[^_]*_[^_]*);
...

seems to cause problems for SOLR, I assume because of the [ or ^ 
character.


Can somebody please advise how to handle character exclusion in such 
searches?


Any help or pointers are much appreciated!

Thanks

Ben






Re: Excluding characters from a wildcard query

2009-07-01 Thread Uwe Klosa
You have to escape all special characters. Even [ to \[

Have a look here http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Uwe

2009/7/1 Ben b...@autonomic.net

 I only just noticed that this is an exception being thrown by the
 lucene.queryParser. Should I be mailing on the lucene list, or is it ok
 here?

 I'm beginning to wonder if the fq can handle the type of character
 exclusion I'm trying in the RegExp.
 Escaping the string also doesn't work  :

 Cannot parse 'vector:_\*[\^_\]\*_[\^_\]\*_[\^_\]\*': Encountered ] at
 line 1, column 15.
 Was expecting one of:
   TO ...
   RANGEIN_QUOTED ...
   RANGEIN_GOOP ...

 Ben wrote:


 Ben wrote:

 The exception SOLR raises is :

 org.apache.lucene.queryParser.ParseException: Cannot parse
 'vector:_*[^_]*_[^_]*_[^_]*': Encountered ] at line 1, column 12.
 Was expecting one of:
   TO ...
   RANGEIN_QUOTED ...
   RANGEIN_GOOP ...
  Ben wrote:

 Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching
 anything with an underscore in the string) using some code like :

 ...
 parameters.add(fq, vector:[^_]*_[^_]*);
 ...

 seems to cause problems for SOLR, I assume because of the [ or ^
 character.

 Can somebody please advise how to handle character exclusion in such
 searches?

 Any help or pointers are much appreciated!

 Thanks

 Ben







Re: Excluding characters from a wildcard query

2009-07-01 Thread Ben
Yes, I had done that... however, I'm beginning to see now that what I am 
doing is called a wildcard query which is going via Lucene's queryparser.
Lucene's query parser doesn't not support the regexp idea of character 
exclusion ... i.e. I'm not trying to match [ I'm trying to express 
Match as many characters as possible, which are not underscores with [^_]*


Perhaps I'm going about my whole problem in an ineffective way, but I'm 
not sure how I can sensibly describe what I'm doing without it becoming 
a long document.


The only other approach I can think of is to change what I'm indexing 
but I'm not sure how to achieve that.

I've tried explaining it once, and obviously failed, so I'll try again.

I'm given a string containing many vectors (where each dimension is 
separated by an underscore, and each vector is seperated by a comma) e.g.


A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

I want my facet query to tell me if, within one of the vectors within 
that string, there is a match for dimensions I'm interested in. Of the 
four dimensions in this example, I may choose to fix an arbitrary number 
of them with values, and the rest with wildcards e.g. I might look for a 
facet containing Ox_*_*_* so one of the vectors in the string must have 
its first dimension matching Ox and I don't care about the rest.


***Is there a way to break down this string on the comma's so that I can 
apply a normal wildcard query and SOLR applies it to each 
individually?*** That would solve all my problems :

e.g.
The string is internally represented in lucene/solr as
A1_B1_C1_D1
A2_B2_C2_D2
A3_B3_C3_D3

where it tries to match the wildcard query on each in turn?

Thanks for you help, I'm deeply confused about this at the moment...

Ben


Re: Excluding characters from a wildcard query

2009-07-01 Thread Uwe Klosa
You should split the strings at the comma yourself and store the values in a
multivalued field? Then wildcard search like A1_* are not a problem. I don't
know so much about facets. But if they work on multivalued fields that
should be then no problem at all.

Uwe

2009/7/1 Ben b...@autonomic.net

 Yes, I had done that... however, I'm beginning to see now that what I am
 doing is called a wildcard query which is going via Lucene's queryparser.
 Lucene's query parser doesn't not support the regexp idea of character
 exclusion ... i.e. I'm not trying to match [ I'm trying to express Match
 as many characters as possible, which are not underscores with [^_]*

 Perhaps I'm going about my whole problem in an ineffective way, but I'm not
 sure how I can sensibly describe what I'm doing without it becoming a long
 document.

 The only other approach I can think of is to change what I'm indexing but
 I'm not sure how to achieve that.
 I've tried explaining it once, and obviously failed, so I'll try again.

 I'm given a string containing many vectors (where each dimension is
 separated by an underscore, and each vector is seperated by a comma) e.g.

 A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

 I want my facet query to tell me if, within one of the vectors within that
 string, there is a match for dimensions I'm interested in. Of the four
 dimensions in this example, I may choose to fix an arbitrary number of them
 with values, and the rest with wildcards e.g. I might look for a facet
 containing Ox_*_*_* so one of the vectors in the string must have its first
 dimension matching Ox and I don't care about the rest.

 ***Is there a way to break down this string on the comma's so that I can
 apply a normal wildcard query and SOLR applies it to each individually?***
 That would solve all my problems :
 e.g.
 The string is internally represented in lucene/solr as
 A1_B1_C1_D1
 A2_B2_C2_D2
 A3_B3_C3_D3

 where it tries to match the wildcard query on each in turn?

 Thanks for you help, I'm deeply confused about this at the moment...

 Ben



Re: Excluding characters from a wildcard query

2009-07-01 Thread Ben
Is there a way in the Schema to specify that the comma should be used to 
split the values up? 
e.g. Can I specify my vector field as multivalue and also specify some 
sort of tokeniser to automatically split on commas?


Ben


Uwe Klosa wrote:

You should split the strings at the comma yourself and store the values in a
multivalued field? Then wildcard search like A1_* are not a problem. I don't
know so much about facets. But if they work on multivalued fields that
should be then no problem at all.

Uwe

2009/7/1 Ben b...@autonomic.net

  

Yes, I had done that... however, I'm beginning to see now that what I am
doing is called a wildcard query which is going via Lucene's queryparser.
Lucene's query parser doesn't not support the regexp idea of character
exclusion ... i.e. I'm not trying to match [ I'm trying to express Match
as many characters as possible, which are not underscores with [^_]*

Perhaps I'm going about my whole problem in an ineffective way, but I'm not
sure how I can sensibly describe what I'm doing without it becoming a long
document.

The only other approach I can think of is to change what I'm indexing but
I'm not sure how to achieve that.
I've tried explaining it once, and obviously failed, so I'll try again.

I'm given a string containing many vectors (where each dimension is
separated by an underscore, and each vector is seperated by a comma) e.g.

A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

I want my facet query to tell me if, within one of the vectors within that
string, there is a match for dimensions I'm interested in. Of the four
dimensions in this example, I may choose to fix an arbitrary number of them
with values, and the rest with wildcards e.g. I might look for a facet
containing Ox_*_*_* so one of the vectors in the string must have its first
dimension matching Ox and I don't care about the rest.

***Is there a way to break down this string on the comma's so that I can
apply a normal wildcard query and SOLR applies it to each individually?***
That would solve all my problems :
e.g.
The string is internally represented in lucene/solr as
A1_B1_C1_D1
A2_B2_C2_D2
A3_B3_C3_D3

where it tries to match the wildcard query on each in turn?

Thanks for you help, I'm deeply confused about this at the moment...

Ben




  




Re: Excluding characters from a wildcard query

2009-07-01 Thread Uwe Klosa
To get the desired efffect I described you have to do the split before you
send the document to solr. I'm not aware of an analyzer that can split one
field value into several field values. The analyzers and tokenizers do
create tokens from field values in many different ways.

As I see it you have to do some preprocessing yourself.

Uwe

2009/7/1 Ben b...@autonomic.net

 Is there a way in the Schema to specify that the comma should be used to
 split the values up? e.g. Can I specify my vector field as multivalue and
 also specify some sort of tokeniser to automatically split on commas?

 Ben



 Uwe Klosa wrote:

 You should split the strings at the comma yourself and store the values in
 a
 multivalued field? Then wildcard search like A1_* are not a problem. I
 don't
 know so much about facets. But if they work on multivalued fields that
 should be then no problem at all.

 Uwe

 2009/7/1 Ben b...@autonomic.net



 Yes, I had done that... however, I'm beginning to see now that what I am
 doing is called a wildcard query which is going via Lucene's
 queryparser.
 Lucene's query parser doesn't not support the regexp idea of character
 exclusion ... i.e. I'm not trying to match [ I'm trying to express
 Match
 as many characters as possible, which are not underscores with [^_]*

 Perhaps I'm going about my whole problem in an ineffective way, but I'm
 not
 sure how I can sensibly describe what I'm doing without it becoming a
 long
 document.

 The only other approach I can think of is to change what I'm indexing but
 I'm not sure how to achieve that.
 I've tried explaining it once, and obviously failed, so I'll try again.

 I'm given a string containing many vectors (where each dimension is
 separated by an underscore, and each vector is seperated by a comma) e.g.

 A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

 I want my facet query to tell me if, within one of the vectors within
 that
 string, there is a match for dimensions I'm interested in. Of the four
 dimensions in this example, I may choose to fix an arbitrary number of
 them
 with values, and the rest with wildcards e.g. I might look for a facet
 containing Ox_*_*_* so one of the vectors in the string must have its
 first
 dimension matching Ox and I don't care about the rest.

 ***Is there a way to break down this string on the comma's so that I can
 apply a normal wildcard query and SOLR applies it to each
 individually?***
 That would solve all my problems :
 e.g.
 The string is internally represented in lucene/solr as
 A1_B1_C1_D1
 A2_B2_C2_D2
 A3_B3_C3_D3

 where it tries to match the wildcard query on each in turn?

 Thanks for you help, I'm deeply confused about this at the moment...

 Ben










Re: Excluding characters from a wildcard query

2009-07-01 Thread Ben

I'm not quite sure I understand exactly what you mean.
The string I'm processing could have many tens of thousands of values... 
I hope you aren't implying I'd need to split it into many tens of 
thousands of columns.


If you're saying what I think you're saying, you're saying that I should 
leave whitespaces between the individual parts of the string, pass in 
the string into a multiValued field and have SOLR internally treat 
each word as an individual entity? 


Thanks for your help with this...

Ben

Uwe Klosa wrote:

To get the desired efffect I described you have to do the split before you
send the document to solr. I'm not aware of an analyzer that can split one
field value into several field values. The analyzers and tokenizers do
create tokens from field values in many different ways.

As I see it you have to do some preprocessing yourself.

Uwe

2009/7/1 Ben b...@autonomic.net

  

Is there a way in the Schema to specify that the comma should be used to
split the values up? e.g. Can I specify my vector field as multivalue and
also specify some sort of tokeniser to automatically split on commas?

Ben



Uwe Klosa wrote:



You should split the strings at the comma yourself and store the values in
a
multivalued field? Then wildcard search like A1_* are not a problem. I
don't
know so much about facets. But if they work on multivalued fields that
should be then no problem at all.

Uwe

2009/7/1 Ben b...@autonomic.net



  

Yes, I had done that... however, I'm beginning to see now that what I am
doing is called a wildcard query which is going via Lucene's
queryparser.
Lucene's query parser doesn't not support the regexp idea of character
exclusion ... i.e. I'm not trying to match [ I'm trying to express
Match
as many characters as possible, which are not underscores with [^_]*

Perhaps I'm going about my whole problem in an ineffective way, but I'm
not
sure how I can sensibly describe what I'm doing without it becoming a
long
document.

The only other approach I can think of is to change what I'm indexing but
I'm not sure how to achieve that.
I've tried explaining it once, and obviously failed, so I'll try again.

I'm given a string containing many vectors (where each dimension is
separated by an underscore, and each vector is seperated by a comma) e.g.

A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

I want my facet query to tell me if, within one of the vectors within
that
string, there is a match for dimensions I'm interested in. Of the four
dimensions in this example, I may choose to fix an arbitrary number of
them
with values, and the rest with wildcards e.g. I might look for a facet
containing Ox_*_*_* so one of the vectors in the string must have its
first
dimension matching Ox and I don't care about the rest.

***Is there a way to break down this string on the comma's so that I can
apply a normal wildcard query and SOLR applies it to each
individually?***
That would solve all my problems :
e.g.
The string is internally represented in lucene/solr as
A1_B1_C1_D1
A2_B2_C2_D2
A3_B3_C3_D3

where it tries to match the wildcard query on each in turn?

Thanks for you help, I'm deeply confused about this at the moment...

Ben






  



  




Re: Excluding characters from a wildcard query

2009-07-01 Thread Uwe Klosa
2009/7/1 Ben b...@autonomic.net

 I'm not quite sure I understand exactly what you mean.
 The string I'm processing could have many tens of thousands of values... I
 hope you aren't implying I'd need to split it into many tens of thousands of
 columns.


No, that is not what I meant. It will be one field (column) with tens of
thousands of values.




 If you're saying what I think you're saying, you're saying that I should
 leave whitespaces between the individual parts of the string, pass in the
 string into a multiValued field and have SOLR internally treat each word
 as an individual entity?
 Thanks for your help with this...


I said nothing about whitespaces. I don't know how you update your solr
documents. Are you using XML or Solrj?

Uwe


Re: Excluding characters from a wildcard query

2009-07-01 Thread Ben
my brain was switched off.  I'm using SOLRJ, which means I'll need to 
specify multiple :


addMultipleFields(solrDoc, vector, vectorvalue, 1.0f);

for each value to be added to the multiValuedField.

Then, with luck, the simple wildcard query will be executed over each 
individual value when looking for matches, meaning the simple query 
syntax can made adequate to do what's needed.


Many thanks Uwe.

B

Uwe Klosa wrote:

2009/7/1 Ben b...@autonomic.net

  

I'm not quite sure I understand exactly what you mean.
The string I'm processing could have many tens of thousands of values... I
hope you aren't implying I'd need to split it into many tens of thousands of
columns.




No, that is not what I meant. It will be one field (column) with tens of
thousands of values.


  

If you're saying what I think you're saying, you're saying that I should
leave whitespaces between the individual parts of the string, pass in the
string into a multiValued field and have SOLR internally treat each word
as an individual entity?
Thanks for your help with this...




I said nothing about whitespaces. I don't know how you update your solr
documents. Are you using XML or Solrj?

Uwe

  




Re: Excluding characters from a wildcard query

2009-07-01 Thread Shalin Shekhar Mangar
On Wed, Jul 1, 2009 at 5:07 PM, Ben b...@autonomic.net wrote:

 my brain was switched off.  I'm using SOLRJ, which means I'll need to
 specify multiple :

 addMultipleFields(solrDoc, vector, vectorvalue, 1.0f);

 for each value to be added to the multiValuedField.

 Then, with luck, the simple wildcard query will be executed over each
 individual value when looking for matches, meaning the simple query syntax
 can made adequate to do what's needed.


I'm not sure if you can do prefix queries with the fq parameter. You will
need to use the 'q' parameter for that.

You may also want to look at the regex query support in lucene (contrib
package). I don't think that is supported out of the box in Solr yet.
-- 
Regards,
Shalin Shekhar Mangar.


Excluding characters from a wildcard query

2009-06-30 Thread Ben
Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching 
anything with an underscore in the string) using some code like :


...
parameters.add(fq, vector:[^_]*_[^_]*);
...

seems to cause problems for SOLR, I assume because of the [ or ^ character.

Can somebody please advise how to handle character exclusion in such 
searches?


Any help or pointers are much appreciated!

Thanks

Ben


Re: Excluding characters from a wildcard query - More Info

2009-06-30 Thread Ben

The exception SOLR raises is :

org.apache.lucene.queryParser.ParseException: Cannot parse 
'vector:_*[^_]*_[^_]*_[^_]*': Encountered ] at line 1, column 12.

Was expecting one of:
   TO ...
   RANGEIN_QUOTED ...
   RANGEIN_GOOP ...
  


Ben wrote:
Passing in a RegularExpression like [^_]*_[^_]* (e.g. matching 
anything with an underscore in the string) using some code like :


...
parameters.add(fq, vector:[^_]*_[^_]*);
...

seems to cause problems for SOLR, I assume because of the [ or ^ 
character.


Can somebody please advise how to handle character exclusion in such 
searches?


Any help or pointers are much appreciated!

Thanks

Ben