Analyzer for indexing only, not for queries

2010-03-12 Thread Michael Kuhlmann
Hi all,

I have a field with some kind of category tree as a string. The format
is like this:
prefixfirstsecond#prefixotherfirstothersecond

So, the document is categorized in two categories, separated by '#', and
all categories start with the same prefix which I don't want to use.

For indexing, I have some fields for each category level, filled by
copyFields. For instance, the first level is defined using this type:

fieldType name=text_first_cat class=solr.TextField
positionIncrementGap=100
   analyzer type=index
  tokenizer class=solr.PatternTokenizerFactory
  pattern=(?:#|^)\w*([\p{L}\d]+) group=1/
/analyzer
/fieldType

This works fine, except one thing: This analyzer is being used for
queries also, not only for indexing. So, a query for xfirst gets
results, but a query for first only finds nothing. However, I want the
latter case.

If I add some pseudo-analyzer that does nothing like this:
   analyzer type=query
  tokenizer class=solr.PatternTokenizerFactory
  pattern=.* group=0/
/analyzer
then I get the result that I want. If I don't add a query analyzer at
all, the index analyzer is being used for queries, what is strange and
not what I would expect.

I just want some
Take-the-query-as-it-is-and-do-nothing-with-it-Analyzer, as if I don't
specify some analyzer at all. However, if I simply add
analyzer type=query /
to it, I get a parser exception from Solr.

Is there a clean solution for this? And why is Solr ignoring the
analyzer type as long as there is only one analyzer defines per type?

Greetings,
Michael


Re: Analyzer for indexing only, not for queries

2010-03-12 Thread Erick Erickson
Well, what would you have SOLR do that makes sense if you
don't define a query analyzer? Very very strange things
happen if you use different analyzers for indexing
and querying. At least defaulting that way has a *chance* of
giving expected results...

Why not use, say, KeywordTokenizerFactory if you really
want the query analyzer to do nothing? Perhaps lowercasing
etc. See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH
Erick

On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann 
michael.kuhlm...@zalando.de wrote:

 Hi all,

 I have a field with some kind of category tree as a string. The format
 is like this:
prefixfirstsecond#prefixotherfirstothersecond

 So, the document is categorized in two categories, separated by '#', and
 all categories start with the same prefix which I don't want to use.

 For indexing, I have some fields for each category level, filled by
 copyFields. For instance, the first level is defined using this type:

 fieldType name=text_first_cat class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
  tokenizer class=solr.PatternTokenizerFactory
  pattern=(?:#|^)\w*([\p{L}\d]+) group=1/
/analyzer
/fieldType

 This works fine, except one thing: This analyzer is being used for
 queries also, not only for indexing. So, a query for xfirst gets
 results, but a query for first only finds nothing. However, I want the
 latter case.

 If I add some pseudo-analyzer that does nothing like this:
   analyzer type=query
  tokenizer class=solr.PatternTokenizerFactory
  pattern=.* group=0/
/analyzer
 then I get the result that I want. If I don't add a query analyzer at
 all, the index analyzer is being used for queries, what is strange and
 not what I would expect.

 I just want some
 Take-the-query-as-it-is-and-do-nothing-with-it-Analyzer, as if I don't
 specify some analyzer at all. However, if I simply add
analyzer type=query /
 to it, I get a parser exception from Solr.

 Is there a clean solution for this? And why is Solr ignoring the
 analyzer type as long as there is only one analyzer defines per type?

 Greetings,
 Michael



KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries

2010-03-12 Thread Michael Kuhlmann
Hi Erick,

thank you very much for your help. What's confusing me is that another
of my fields does not have any analyzers defined at all, and it's
working fine without problems. So, it must be possible to define field
type without specifying any analyzers. I don't understand why it
shouldn't be possible any more if either the index or the query analyzer
is specified and the other not. Maybe it would be clearer if Solr would
raise an exception in this case instead of using some analyzer that was
specified for the opposite type.

Anyway; I took your advice and used the KeywordTokenizerFactory instead.
Great! Now it does excactly what I want. Thanks again!

But may I ask another question? As with the categories, I have some
fields that are only used for faceting, so they're only queried by facet
results. No modification is needed, no lowercase, nothing. So the
KeywordTokenizerFactory is perfect for them.

Alas, when the value contains spaces, I'm still getting too many
results. I have a field defined like this:

fieldType name=text_unchanged class=solr.StrField
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

(Using solr.TextField didn't change anything)

When quering for:
fq=label:Aces+of+London

I get the result:
 facet_fields:{
label:[
 Aces of London,31,
 Feud London,2,
 Fly London,2],
},

I get the same result when taking Feud London as the facet value.

When inspecting the index with the schema browser, I can see that all
labels are tokenized correctly in complete, i.e. there's no token
London, but a token Aces of London. So the KeywordTokenizer seems to
work as expected, at least for indexing. It's only that the facet query
is not narrow enough.

Even the superb Solr book didn't help me here. Do you - or any other -
has/have a clue what I'm doing wrong here?

Greetings,
Michael

On 03/12/10 14:52, Erick Erickson wrote:
 Well, what would you have SOLR do that makes sense if you
 don't define a query analyzer? Very very strange things
 happen if you use different analyzers for indexing
 and querying. At least defaulting that way has a *chance* of
 giving expected results...
 
 Why not use, say, KeywordTokenizerFactory if you really
 want the query analyzer to do nothing? Perhaps lowercasing
 etc. See:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH
 Erick
 
 On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann 
 michael.kuhlm...@zalando.de wrote:
 


Re: KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries

2010-03-12 Thread Erick Erickson
What's confusing me is that another
of my fields does not have any analyzers defined at all, and it's
working fine without problems.

Field or fieldType?

 So, it must be possible to define field
type without specifying any analyzers. 

Truth to tell, I don't know off the top of my head
what happens if you define no analyzer for a fieldType.
I think it would be bad practice anyway, *I* want to *know*
what indexing and analyzing operations are going on so
I can predict the resutls G. Someone want to chime in?

As for the second part, I'll have to defer (my boss actually wants
me to do work). But you'd get a better response if you posted
it as a separate thread. See:
http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is hidden in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking

Best, Erick

 http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijackingOn Fri,
Mar 12, 2010 at 10:24 AM, Michael Kuhlmann
michael.kuhlm...@zalando.de wrote:

Hi Erick,

 thank you very much for your help. What's confusing me is that another
 of my fields does not have any analyzers defined at all, and it's
 working fine without problems. So, it must be possible to define field
 type without specifying any analyzers. I don't understand why it
 shouldn't be possible any more if either the index or the query analyzer
 is specified and the other not. Maybe it would be clearer if Solr would
 raise an exception in this case instead of using some analyzer that was
 specified for the opposite type.

 Anyway; I took your advice and used the KeywordTokenizerFactory instead.
 Great! Now it does excactly what I want. Thanks again!

 But may I ask another question? As with the categories, I have some
 fields that are only used for faceting, so they're only queried by facet
 results. No modification is needed, no lowercase, nothing. So the
 KeywordTokenizerFactory is perfect for them.

 Alas, when the value contains spaces, I'm still getting too many
 results. I have a field defined like this:

fieldType name=text_unchanged class=solr.StrField
 positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

 (Using solr.TextField didn't change anything)

 When quering for:
 fq=label:Aces+of+London

 I get the result:
  facet_fields:{
label:[
 Aces of London,31,
 Feud London,2,
 Fly London,2],
 },

 I get the same result when taking Feud London as the facet value.

 When inspecting the index with the schema browser, I can see that all
 labels are tokenized correctly in complete, i.e. there's no token
 London, but a token Aces of London. So the KeywordTokenizer seems to
 work as expected, at least for indexing. It's only that the facet query
 is not narrow enough.

 Even the superb Solr book didn't help me here. Do you - or any other -
 has/have a clue what I'm doing wrong here?

 Greetings,
 Michael

 On 03/12/10 14:52, Erick Erickson wrote:
  Well, what would you have SOLR do that makes sense if you
  don't define a query analyzer? Very very strange things
  happen if you use different analyzers for indexing
  and querying. At least defaulting that way has a *chance* of
  giving expected results...
 
  Why not use, say, KeywordTokenizerFactory if you really
  want the query analyzer to do nothing? Perhaps lowercasing
  etc. See:
  http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 
  http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH
  Erick
 
  On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann 
  michael.kuhlm...@zalando.de wrote:
 



Re: KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries

2010-03-12 Thread Michael Kuhlmann
Hi Erick,

On 03/12/10 17:09, Erick Erickson wrote:
 What's confusing me is that another
 of my fields does not have any analyzers defined at all, and it's
 working fine without problems.
 
 Field or fieldType?

...one of my fields with a fieldtype that does not have any analyzer
defined at all, ... ;-)

 
  So, it must be possible to define field
 type without specifying any analyzers. 
 
 Truth to tell, I don't know off the top of my head
 what happens if you define no analyzer for a fieldType.
 I think it would be bad practice anyway, *I* want to *know*
 what indexing and analyzing operations are going on so
 I can predict the resutls G. Someone want to chime in?

I looks like that the whole string will be used as a token, as the
KeywordTokenizerFactory already does. You're right that it's always
better to explicitly specify what you want. But as I didn't know what
the KeywordTokenizerFactory does before (I assumed that it would
tokenize the string into several keywords), and as long as I'm in
development phase, the un-specified behaviour was quite okay for me.

Once again, thank you for your help!

Michael