Re: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Erick Erickson
Yeah, this is a head scratcher. But it _has_ to be that way for things
like edismax to work where you mix-and-match fielded and un-fielded
terms. I.e. I can have a query like q=field1:whatever some more
stuffqf=field2,field3,field4 where I want whatever to be evaluated
only against field1, but the remaining terms to be searched for in the
three other fields.

The deal is that how you want _individual terms_ handled at index time
may be different than at query time, WordDelimiterFilterFactory and
SynonymFilterFactory are prime examples of this. Getting my head
around why field analysis is completely different from query _parsing_
took me a while. But the fact that both are query is confusing, I'm
just not sure what would be better since they're very closely related,
they just both deal with queries just at different times.

Missed the wildcards, you're right you need to escape Or use
the prefix query parser. It'd look like:
q={!prefix f=proj_name_sort}CR610070 An

No escaping is necessary. If you add debug=query to a query using the
prefix queries you see that there's an implied * trailing... Do be
aware, though, that there is _no_ analysis done, so things like
lowercasing would have to be done by the app.

Neither one is more correct, in fact I believe that the wildcard
query becomes a prefix query eventually, strictly a matter of how you
want to deal with that in the app.

Best,
Erick

On Wed, Mar 25, 2015 at 10:04 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote:
 Thanks for a quick response.

 A bit confusing that analyzer of query type configured to use 
 KeywordTokenizerFactory does not un-tokenize query criteria.
 I guess whitespace only the special case because it separates phrases in a 
 query and runs prior analyzing.

 Actually I am handling a query the way you are recommended:
 Double quotes for exact matching and escaped whitespace for a values with 
 wildcards (double quotes do not work as probably considering * wildcard as 
 a part of the criteria value).

 Thanks
 Vadim

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wednesday, March 25, 2015 6:34 PM
 To: solr-user@lucene.apache.org
 Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

 This is a _very_ common thing we all had to learn; what you're seeing is the 
 results of the _query parser_, not the analysis chain. Anything like
 proj_name_sort:term1 term2 gets split at the query parser level, attaching 
 debug=query to the URL should show down in the parsed query section 
 something like:

 proj_name_sort:term1 default_search_field:term2

 To get thing through the query parser, enclose in double quotes, escape the 
 space and such. That'll get the terms _as a single token_ to the analysis 
 chain for that field where the behavior will be what you expect.

 Best,
 Erick

 On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com 
 wrote:
 Hello,

 solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
 SOLR documentation shouldn't do that.


 For example I have the following configuration for the fields proj_name 
 and proj_name_sort:

 field name=proj_name type=sortable_text_general indexed=true
 stored=true/ field name=proj_name_sort type=string_sort
 indexed=true stored=false/ ..

 copyField source=proj_name dest=proj_name_sort /
 ..

 fieldType name=string_sort class=solr.TextField sortMissingLast=true 
 omitNorms=true
   analyzer
 !-- KeywordTokenizer does no actual tokenizing, so the entire
  input string is preserved as a single token
  --
 tokenizer class=solr.KeywordTokenizerFactory/
 !-- The LowerCase TokenFilter does what you expect, which can be
  when you want your sorting to be case insensitive
   --
 filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /
   /analyzer
 /fieldType

 There are 3 indexed documents having the respective field values:
 proj_name:
 Test1008
 CR610070 Test1
 CR610070 Another Test2

 Searching on the proj_name_sort giving me the following results:

 Query

 Expected

 Real

 Comments

 proj_name_sort : CR610070 Test1

 CR610070 Test1

 CR610070 Test1

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te

 None

 None

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te*

 CR610070 Test1

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 An*

 CR610070 Another Test2

 CR610070 Another Test2

 Expectable as seems applying wild card on un-tokenized value

 proj_name_sort : CR610070 Another Te*

 CR610070 Another Test2

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 Another Test1*

 None

 CR610070 Test1, Test1008, CR610070

RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

2015-03-25 Thread Vadim Gorlovetsky
Thanks for a quick response.

A bit confusing that analyzer of query type configured to use 
KeywordTokenizerFactory does not un-tokenize query criteria.
I guess whitespace only the special case because it separates phrases in a 
query and runs prior analyzing.

Actually I am handling a query the way you are recommended:
Double quotes for exact matching and escaped whitespace for a values with 
wildcards (double quotes do not work as probably considering * wildcard as a 
part of the criteria value).

Thanks
Vadim

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, March 25, 2015 6:34 PM
To: solr-user@lucene.apache.org
Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

This is a _very_ common thing we all had to learn; what you're seeing is the 
results of the _query parser_, not the analysis chain. Anything like
proj_name_sort:term1 term2 gets split at the query parser level, attaching 
debug=query to the URL should show down in the parsed query section 
something like:

proj_name_sort:term1 default_search_field:term2

To get thing through the query parser, enclose in double quotes, escape the 
space and such. That'll get the terms _as a single token_ to the analysis chain 
for that field where the behavior will be what you expect.

Best,
Erick

On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote:
 Hello,

 solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
 SOLR documentation shouldn't do that.


 For example I have the following configuration for the fields proj_name and 
 proj_name_sort:

 field name=proj_name type=sortable_text_general indexed=true 
 stored=true/ field name=proj_name_sort type=string_sort 
 indexed=true stored=false/ ..

 copyField source=proj_name dest=proj_name_sort / 
 ..

 fieldType name=string_sort class=solr.TextField sortMissingLast=true 
 omitNorms=true
   analyzer
 !-- KeywordTokenizer does no actual tokenizing, so the entire
  input string is preserved as a single token
  --
 tokenizer class=solr.KeywordTokenizerFactory/
 !-- The LowerCase TokenFilter does what you expect, which can be
  when you want your sorting to be case insensitive
   --
 filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /
   /analyzer
 /fieldType

 There are 3 indexed documents having the respective field values:
 proj_name:
 Test1008
 CR610070 Test1
 CR610070 Another Test2

 Searching on the proj_name_sort giving me the following results:

 Query

 Expected

 Real

 Comments

 proj_name_sort : CR610070 Test1

 CR610070 Test1

 CR610070 Test1

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te

 None

 None

 Expectable as seems searching exact un-tokenized value

 proj_name_sort : CR610070 Te*

 CR610070 Test1

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 An*

 CR610070 Another Test2

 CR610070 Another Test2

 Expectable as seems applying wild card on un-tokenized value

 proj_name_sort : CR610070 Another Te*

 CR610070 Another Test2

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?

 proj_name_sort : CR610070 Another Test1*

 None

 CR610070 Test1, Test1008, CR610070 Another Test2

 Seems splits on tokens by whitespace ?


 Please, advise the way to search on un-tokenized fields using partial 
 criteria and wild cards.

 Thanks
 Vadim


 This message and the information contained herein is proprietary and 
 confidential and subject to the Amdocs policy statement, you may review at 
 http://www.amdocs.com/email_disclaimer.asp

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp