I'm not quite sure I understand exactly what you mean.
The string I'm processing could have many tens of thousands of values... I hope you aren't implying I'd need to split it into many tens of thousands of "columns".

If you're saying what I think you're saying, you're saying that I should leave whitespaces between the individual parts of the string, pass in the string into a "multiValued" field and have SOLR internally treat each "word" as an individual entity?
Thanks for your help with this...

Ben

Uwe Klosa wrote:
To get the desired efffect I described you have to do the split before you
send the document to solr. I'm not aware of an analyzer that can split one
field value into several field values. The analyzers and tokenizers do
create tokens from field values in many different ways.

As I see it you have to do some preprocessing yourself.

Uwe

2009/7/1 Ben <b...@autonomic.net>

Is there a way in the Schema to specify that the comma should be used to
split the values up? e.g. Can I specify my "vector" field as multivalue and
also specify some sort of tokeniser to automatically split on commas?

Ben



Uwe Klosa wrote:

You should split the strings at the comma yourself and store the values in
a
multivalued field? Then wildcard search like A1_* are not a problem. I
don't
know so much about facets. But if they work on multivalued fields that
should be then no problem at all.

Uwe

2009/7/1 Ben <b...@autonomic.net>



Yes, I had done that... however, I'm beginning to see now that what I am
doing is called a "wildcard query" which is going via Lucene's
queryparser.
Lucene's query parser doesn't not support the regexp idea of character
exclusion ... i.e. I'm not trying to match "[" I'm trying to express
"Match
as many characters as possible, which are not underscores" with [^_]*

Perhaps I'm going about my whole problem in an ineffective way, but I'm
not
sure how I can sensibly describe what I'm doing without it becoming a
long
document.

The only other approach I can think of is to change what I'm indexing but
I'm not sure how to achieve that.
I've tried explaining it once, and obviously failed, so I'll try again.

I'm given a string containing many vectors (where each dimension is
separated by an underscore, and each vector is seperated by a comma) e.g.

A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

I want my facet query to tell me if, within one of the vectors within
that
string, there is a match for dimensions I'm interested in. Of the four
dimensions in this example, I may choose to fix an arbitrary number of
them
with values, and the rest with wildcards e.g. I might look for a facet
containing Ox_*_*_* so one of the vectors in the string must have its
first
dimension matching "Ox" and I don't care about the rest.

***Is there a way to break down this string on the comma's so that I can
apply a normal wildcard query and SOLR applies it to each
individually?***
That would solve all my problems :
e.g.
The string is internally represented in lucene/solr as
A1_B1_C1_D1
A2_B2_C2_D2
A3_B3_C3_D3

where it tries to match the wildcard query on each in turn?

Thanks for you help, I'm deeply confused about this at the moment...

Ben






Reply via email to