Yeah, this is a head scratcher. But it _has_ to be that way for things
like edismax to work where you mix-and-match fielded and un-fielded
terms. I.e. I can have a query like q=field1:whatever some more
stuffqf=field2,field3,field4 where I want whatever to be evaluated
only against field1, but the remaining terms to be searched for in the
three other fields.
The deal is that how you want _individual terms_ handled at index time
may be different than at query time, WordDelimiterFilterFactory and
SynonymFilterFactory are prime examples of this. Getting my head
around why field analysis is completely different from query _parsing_
took me a while. But the fact that both are query is confusing, I'm
just not sure what would be better since they're very closely related,
they just both deal with queries just at different times.
Missed the wildcards, you're right you need to escape Or use
the prefix query parser. It'd look like:
q={!prefix f=proj_name_sort}CR610070 An
No escaping is necessary. If you add debug=query to a query using the
prefix queries you see that there's an implied * trailing... Do be
aware, though, that there is _no_ analysis done, so things like
lowercasing would have to be done by the app.
Neither one is more correct, in fact I believe that the wildcard
query becomes a prefix query eventually, strictly a matter of how you
want to deal with that in the app.
Best,
Erick
On Wed, Mar 25, 2015 at 10:04 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote:
Thanks for a quick response.
A bit confusing that analyzer of query type configured to use
KeywordTokenizerFactory does not un-tokenize query criteria.
I guess whitespace only the special case because it separates phrases in a
query and runs prior analyzing.
Actually I am handling a query the way you are recommended:
Double quotes for exact matching and escaped whitespace for a values with
wildcards (double quotes do not work as probably considering * wildcard as
a part of the criteria value).
Thanks
Vadim
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, March 25, 2015 6:34 PM
To: solr-user@lucene.apache.org
Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces
This is a _very_ common thing we all had to learn; what you're seeing is the
results of the _query parser_, not the analysis chain. Anything like
proj_name_sort:term1 term2 gets split at the query parser level, attaching
debug=query to the URL should show down in the parsed query section
something like:
proj_name_sort:term1 default_search_field:term2
To get thing through the query parser, enclose in double quotes, escape the
space and such. That'll get the terms _as a single token_ to the analysis
chain for that field where the behavior will be what you expect.
Best,
Erick
On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com
wrote:
Hello,
solr.KeywordTokenizerFactory seems splitting by whitespaces though according
SOLR documentation shouldn't do that.
For example I have the following configuration for the fields proj_name
and proj_name_sort:
field name=proj_name type=sortable_text_general indexed=true
stored=true/ field name=proj_name_sort type=string_sort
indexed=true stored=false/ ..
copyField source=proj_name dest=proj_name_sort /
..
fieldType name=string_sort class=solr.TextField sortMissingLast=true
omitNorms=true
analyzer
!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
--
tokenizer class=solr.KeywordTokenizerFactory/
!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
--
filter class=solr.LowerCaseFilterFactory /
!-- The TrimFilter removes any leading or trailing whitespace --
filter class=solr.TrimFilterFactory /
/analyzer
/fieldType
There are 3 indexed documents having the respective field values:
proj_name:
Test1008
CR610070 Test1
CR610070 Another Test2
Searching on the proj_name_sort giving me the following results:
Query
Expected
Real
Comments
proj_name_sort : CR610070 Test1
CR610070 Test1
CR610070 Test1
Expectable as seems searching exact un-tokenized value
proj_name_sort : CR610070 Te
None
None
Expectable as seems searching exact un-tokenized value
proj_name_sort : CR610070 Te*
CR610070 Test1
CR610070 Test1, Test1008, CR610070 Another Test2
Seems splits on tokens by whitespace ?
proj_name_sort : CR610070 An*
CR610070 Another Test2
CR610070 Another Test2
Expectable as seems applying wild card on un-tokenized value
proj_name_sort : CR610070 Another Te*
CR610070 Another Test2
CR610070 Test1, Test1008, CR610070 Another Test2
Seems splits on tokens by whitespace ?
proj_name_sort : CR610070 Another Test1*
None
CR610070 Test1, Test1008, CR610070