Analyzer for indexing only, not for queries
Hi all, I have a field with some kind of category tree as a string. The format is like this: prefixfirstsecond#prefixotherfirstothersecond So, the document is categorized in two categories, separated by '#', and all categories start with the same prefix which I don't want to use. For indexing, I have some fields for each category level, filled by copyFields. For instance, the first level is defined using this type: fieldType name=text_first_cat class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=(?:#|^)\w*([\p{L}\d]+) group=1/ /analyzer /fieldType This works fine, except one thing: This analyzer is being used for queries also, not only for indexing. So, a query for xfirst gets results, but a query for first only finds nothing. However, I want the latter case. If I add some pseudo-analyzer that does nothing like this: analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=.* group=0/ /analyzer then I get the result that I want. If I don't add a query analyzer at all, the index analyzer is being used for queries, what is strange and not what I would expect. I just want some Take-the-query-as-it-is-and-do-nothing-with-it-Analyzer, as if I don't specify some analyzer at all. However, if I simply add analyzer type=query / to it, I get a parser exception from Solr. Is there a clean solution for this? And why is Solr ignoring the analyzer type as long as there is only one analyzer defines per type? Greetings, Michael
Re: Analyzer for indexing only, not for queries
Well, what would you have SOLR do that makes sense if you don't define a query analyzer? Very very strange things happen if you use different analyzers for indexing and querying. At least defaulting that way has a *chance* of giving expected results... Why not use, say, KeywordTokenizerFactory if you really want the query analyzer to do nothing? Perhaps lowercasing etc. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH Erick On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann michael.kuhlm...@zalando.de wrote: Hi all, I have a field with some kind of category tree as a string. The format is like this: prefixfirstsecond#prefixotherfirstothersecond So, the document is categorized in two categories, separated by '#', and all categories start with the same prefix which I don't want to use. For indexing, I have some fields for each category level, filled by copyFields. For instance, the first level is defined using this type: fieldType name=text_first_cat class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=(?:#|^)\w*([\p{L}\d]+) group=1/ /analyzer /fieldType This works fine, except one thing: This analyzer is being used for queries also, not only for indexing. So, a query for xfirst gets results, but a query for first only finds nothing. However, I want the latter case. If I add some pseudo-analyzer that does nothing like this: analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=.* group=0/ /analyzer then I get the result that I want. If I don't add a query analyzer at all, the index analyzer is being used for queries, what is strange and not what I would expect. I just want some Take-the-query-as-it-is-and-do-nothing-with-it-Analyzer, as if I don't specify some analyzer at all. However, if I simply add analyzer type=query / to it, I get a parser exception from Solr. Is there a clean solution for this? And why is Solr ignoring the analyzer type as long as there is only one analyzer defines per type? Greetings, Michael
KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries
Hi Erick, thank you very much for your help. What's confusing me is that another of my fields does not have any analyzers defined at all, and it's working fine without problems. So, it must be possible to define field type without specifying any analyzers. I don't understand why it shouldn't be possible any more if either the index or the query analyzer is specified and the other not. Maybe it would be clearer if Solr would raise an exception in this case instead of using some analyzer that was specified for the opposite type. Anyway; I took your advice and used the KeywordTokenizerFactory instead. Great! Now it does excactly what I want. Thanks again! But may I ask another question? As with the categories, I have some fields that are only used for faceting, so they're only queried by facet results. No modification is needed, no lowercase, nothing. So the KeywordTokenizerFactory is perfect for them. Alas, when the value contains spaces, I'm still getting too many results. I have a field defined like this: fieldType name=text_unchanged class=solr.StrField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType (Using solr.TextField didn't change anything) When quering for: fq=label:Aces+of+London I get the result: facet_fields:{ label:[ Aces of London,31, Feud London,2, Fly London,2], }, I get the same result when taking Feud London as the facet value. When inspecting the index with the schema browser, I can see that all labels are tokenized correctly in complete, i.e. there's no token London, but a token Aces of London. So the KeywordTokenizer seems to work as expected, at least for indexing. It's only that the facet query is not narrow enough. Even the superb Solr book didn't help me here. Do you - or any other - has/have a clue what I'm doing wrong here? Greetings, Michael On 03/12/10 14:52, Erick Erickson wrote: Well, what would you have SOLR do that makes sense if you don't define a query analyzer? Very very strange things happen if you use different analyzers for indexing and querying. At least defaulting that way has a *chance* of giving expected results... Why not use, say, KeywordTokenizerFactory if you really want the query analyzer to do nothing? Perhaps lowercasing etc. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH Erick On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann michael.kuhlm...@zalando.de wrote:
Re: KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries
What's confusing me is that another of my fields does not have any analyzers defined at all, and it's working fine without problems. Field or fieldType? So, it must be possible to define field type without specifying any analyzers. Truth to tell, I don't know off the top of my head what happens if you define no analyzer for a fieldType. I think it would be bad practice anyway, *I* want to *know* what indexing and analyzing operations are going on so I can predict the resutls G. Someone want to chime in? As for the second part, I'll have to defer (my boss actually wants me to do work). But you'd get a better response if you posted it as a separate thread. See: http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking Best, Erick http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijackingOn Fri, Mar 12, 2010 at 10:24 AM, Michael Kuhlmann michael.kuhlm...@zalando.de wrote: Hi Erick, thank you very much for your help. What's confusing me is that another of my fields does not have any analyzers defined at all, and it's working fine without problems. So, it must be possible to define field type without specifying any analyzers. I don't understand why it shouldn't be possible any more if either the index or the query analyzer is specified and the other not. Maybe it would be clearer if Solr would raise an exception in this case instead of using some analyzer that was specified for the opposite type. Anyway; I took your advice and used the KeywordTokenizerFactory instead. Great! Now it does excactly what I want. Thanks again! But may I ask another question? As with the categories, I have some fields that are only used for faceting, so they're only queried by facet results. No modification is needed, no lowercase, nothing. So the KeywordTokenizerFactory is perfect for them. Alas, when the value contains spaces, I'm still getting too many results. I have a field defined like this: fieldType name=text_unchanged class=solr.StrField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType (Using solr.TextField didn't change anything) When quering for: fq=label:Aces+of+London I get the result: facet_fields:{ label:[ Aces of London,31, Feud London,2, Fly London,2], }, I get the same result when taking Feud London as the facet value. When inspecting the index with the schema browser, I can see that all labels are tokenized correctly in complete, i.e. there's no token London, but a token Aces of London. So the KeywordTokenizer seems to work as expected, at least for indexing. It's only that the facet query is not narrow enough. Even the superb Solr book didn't help me here. Do you - or any other - has/have a clue what I'm doing wrong here? Greetings, Michael On 03/12/10 14:52, Erick Erickson wrote: Well, what would you have SOLR do that makes sense if you don't define a query analyzer? Very very strange things happen if you use different analyzers for indexing and querying. At least defaulting that way has a *chance* of giving expected results... Why not use, say, KeywordTokenizerFactory if you really want the query analyzer to do nothing? Perhaps lowercasing etc. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH Erick On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann michael.kuhlm...@zalando.de wrote:
Re: KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries
Hi Erick, On 03/12/10 17:09, Erick Erickson wrote: What's confusing me is that another of my fields does not have any analyzers defined at all, and it's working fine without problems. Field or fieldType? ...one of my fields with a fieldtype that does not have any analyzer defined at all, ... ;-) So, it must be possible to define field type without specifying any analyzers. Truth to tell, I don't know off the top of my head what happens if you define no analyzer for a fieldType. I think it would be bad practice anyway, *I* want to *know* what indexing and analyzing operations are going on so I can predict the resutls G. Someone want to chime in? I looks like that the whole string will be used as a token, as the KeywordTokenizerFactory already does. You're right that it's always better to explicitly specify what you want. But as I didn't know what the KeywordTokenizerFactory does before (I assumed that it would tokenize the string into several keywords), and as long as I'm in development phase, the un-specified behaviour was quite okay for me. Once again, thank you for your help! Michael