Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Erick Erickson
Tokenization is fine with facets, that caution is about, say, faceting
on the tokenized body of a document where you have potentially
a huge number of unique tokens.

But if there is a controlled number of distinct values, you shouldn't have
to do anything except index to a tokenized field. I'd remove stemming,
WordDelimiterFactory, etc though, in fact I'd probably just go with
WhiteSpaceTokenizer and, maybe, LowerCaseFilter.

But if you have a huge number of unique values, it doesn't matter whether
they are tokenized or strings, it'll still be a problem.

One note: when faceting for the first time on a newly-started Solr instance,
the caches are filled and the *first* query will be slower, so measure
subsequent queries.

Best
Erick

On Thu, Jan 27, 2011 at 9:09 AM, Dennis Schafroth den...@indexdata.comwrote:

 Hi,

 Pretty novice into SOLR coding, but looking for hints about how (if not
 already done) to implement a PatternTokenizer, that would index this into
 multivalie fields of solr.StrField for facetting. Ex.

 Water -- Irrigation ; Water -- Sewage

 should be tokenized into

 Water
 Irrigation
 Sewage

 in multi-valued non-tokenized fields due to performance. I could do it from
 the outside, but I would this as a opportunity to learn about SOLR.

 It works as I want with the PatternTokenizerFactory when I am using
 solr.TextField, but not when I am using the non-tokenized solr.StrField. But
 according to reading, facets performance is better on non-tokenized fields.
 We need better performance on our faceted searches on these multi-value
 fields.  (25 million documents, three multi-valued facets)

 I would also need to have a filter that filter out identical values as the
 feeds have redundant data as shown above.

 Can anyone point point me in the right direction..

 cheers,
 :-Dennis


Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Erik Hatcher
Beyond what Erick said, I'll add that it is often better to do this from the 
outside and send in multiple actual end-user displayable facet values.  When 
you send in a field like Water -- Irrigation ; Water -- Sewage, that is what 
will get stored (if you have it set to stored), but what you might rather want 
is each individual value stored, which can only be done by the indexer sending 
in multiple values, not through just tokenization.

Erik

On Jan 27, 2011, at 09:09 , Dennis Schafroth wrote:

 Hi, 
 
 Pretty novice into SOLR coding, but looking for hints about how (if not 
 already done) to implement a PatternTokenizer, that would index this into 
 multivalie fields of solr.StrField for facetting. Ex. 
 
 Water -- Irrigation ; Water -- Sewage
 
 should be tokenized into 
 
 Water
 Irrigation
 Sewage
 
 in multi-valued non-tokenized fields due to performance. I could do it from 
 the outside, but I would this as a opportunity to learn about SOLR.
 
 It works as I want with the PatternTokenizerFactory when I am using 
 solr.TextField, but not when I am using the non-tokenized solr.StrField. But 
 according to reading, facets performance is better on non-tokenized fields. 
 We need better performance on our faceted searches on these multi-value 
 fields.  (25 million documents, three multi-valued facets)
 
 I would also need to have a filter that filter out identical values as the 
 feeds have redundant data as shown above.
 
 Can anyone point point me in the right direction..
 
 cheers, 
 :-Dennis



Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Dennis Schafroth
Thanks for the hints! 

Sorry about stealing the thread query range in multivalued date field 
Mistakenly responded to it. 

cheers,
:-Dennis 

On 27/01/2011, at 16.48, Erik Hatcher wrote:

 Beyond what Erick said, I'll add that it is often better to do this from the 
 outside and send in multiple actual end-user displayable facet values.  When 
 you send in a field like Water -- Irrigation ; Water -- Sewage, that is 
 what will get stored (if you have it set to stored), but what you might 
 rather want is each individual value stored, which can only be done by the 
 indexer sending in multiple values, not through just tokenization.
 
   Erik
 
 On Jan 27, 2011, at 09:09 , Dennis Schafroth wrote:
 
 Hi, 
 
 Pretty novice into SOLR coding, but looking for hints about how (if not 
 already done) to implement a PatternTokenizer, that would index this into 
 multivalie fields of solr.StrField for facetting. Ex. 
 
 Water -- Irrigation ; Water -- Sewage
 
 should be tokenized into 
 
 Water
 Irrigation
 Sewage
 
 in multi-valued non-tokenized fields due to performance. I could do it from 
 the outside, but I would this as a opportunity to learn about SOLR.
 
 It works as I want with the PatternTokenizerFactory when I am using 
 solr.TextField, but not when I am using the non-tokenized solr.StrField. But 
 according to reading, facets performance is better on non-tokenized fields. 
 We need better performance on our faceted searches on these multi-value 
 fields.  (25 million documents, three multi-valued facets)
 
 I would also need to have a filter that filter out identical values as the 
 feeds have redundant data as shown above.
 
 Can anyone point point me in the right direction..
 
 cheers, 
 :-Dennis
 
 



Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..

2011-01-27 Thread Chris Hostetter

: Subject: Import Handler for tokenizing facet string into multi-valued
: solr.StrField.. 
: In-Reply-To: 1296123345064-2361292.p...@n3.nabble.com
: References: 1296123345064-2361292.p...@n3.nabble.com


-Hoss