Andrea: Works for me, admittedly through the browser....
I suspect the problem is here: ClientUtils.**escapeQueryChars That doesn't do anything about escaping the spaces, it just handles characters that have special meaning to the query syntax, things like +- etc. Using your field definition, this: http://localhost:8983/solr/select?wt=json&q=ab\ cd\ ef&debug=query&defType=edismax&qf=name eoe produced this output.. - parsedquery_toString: "+(eoe:abcdef | (name:ab name:cd name:ef))", where the field "eoe" is your isbn_issn type. Best, Erick On Mon, Aug 26, 2013 at 4:55 AM, Andrea Gazzarini < andrea.gazzar...@gmail.com> wrote: > Hi Erick, > escaping spaces doesn't work... > > Briefly, > > - In a document I have an ISBN field that (stored value) is > *978-90-04-23560-1* > - In the index I have this value: *9789004235601* > > Now, I want be able to search the document by using: > > 1) q=*978-90-04-23560-1* > 2) q=*978 90 04 23560 1* > 3) q=*9789004235601* > > 1 and 3 works perfectly, 2 doesn't work. > > My code is: > > /SolrQuery query = new SolrQuery(ClientUtils.**escapeQueryChars(req.** > getParameter("q")));/ > > isbn is declared in this way > > <fieldtype name="isbn_issn" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="*solr.**KeywordTokenizerFactory*"/> > > <filter class="solr.**LowerCaseFilterFactory"/> > <filter class="solr.**WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts="0" catenateWords="0" > catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/> > </analyzer> > </fieldtype> > <field name="isbn_issn_search" type="issn_isbn" indexed="true"/> > search handler is: > > <requestHandler name="any_bc" class="solr.SearchHandler" > default="true"> > <lst name="defaults"> > <str name="defType">*dismax*</str> > > <str name="mm">100%</str> > <str name="qf"> > *isbn_issn_search*^10000 > </str> > <str name="pf"> > *isbn_issn_search*^100000 > </str> > <int name="ps">0</int> > <float name="tie">0.1</float> > ... > </requestHandler> > > This is what I get: > > *1) 978-90-04-23560-1** > *path=/select params={start=0&q=*978\-90\-**04\-23560\-1*&sfield=&qt=any_* > *bc&wt=javabin&rows=10&version=**2} *hits=1* status=0 QTime=5* > > 2) ***9789004235601* > *webapp=/solr path=/select params={start=0&q=*** > 9789004235601*&sfield=&qt=any_**bc&wt=javabin&rows=10&version=**2} > *hits=1* status=0 QTime=5* > > 3) **978 90 04 23560 1** > *path=/select params={start=0&*q=978\+90\+**04\+23560\+1*&sfield=&qt=any_* > *bc&wt=javabin&rows=10&version=**2} *hits=0 *status=0 QTime=2* > > *Extract from queryDebug=true: > > <str name="q">978\ 90\ 04\ 23560\ 1</str> > ... > <str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str> > <str name="querystring">978\ 90\ 04\ 23560\ 1</str> > ... > <str name="parsedquery"> > +((DisjunctionMaxQuery((isbn_**issn_search:*978*^10000.0)~0.**1) > DisjunctionMaxQuery((isbn_**issn_search:*90*^10000.0)~0.1) > DisjunctionMaxQuery((isbn_**issn_search:*04*^10000.0)~0.1) > DisjunctionMaxQuery((isbn_**issn_search:*23560*^10000.0)~**0.1) > DisjunctionMaxQuery((isbn_**issn_search:*1*^10000.0)~0.1))**~5) > DisjunctionMaxQuery((isbn_**issn_search:*9789004235601*^** > 100000.0)~0.1) > </str> > > ------------------------------**------------------ > Probably this is a very stupid question but I'm going crazy. In this page > > http://wiki.apache.org/solr/**DisMaxQParserPlugin<http://wiki.apache.org/solr/DisMaxQParserPlugin> > > *Query Structure* > > /For each "word" in the query string, dismax builds a DisjunctionMaxQuery > object for that word across all of the fields in the //qf//param... > > /And seems exactly what it is doing...but what is a "word"? How can I > force//(without using double quotes) spaces in a way that they are > considered part of the word/? > > /Many many many thanks > Andrea > > > On 08/13/2013 04:18 PM, Erick Erickson wrote: > >> I think you can get what you want by escaping the space with a >> backslash.... >> >> YMMV of course. >> Erick >> >> >> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini < >> andrea.gazzar...@gmail.com> wrote: >> >> Hi Erick, >>> sorry if that wasn't clear: this is what I'm actually observing in my >>> application. >>> >>> I wrote the first post after looking at the explain (debugQuery=true): >>> the >>> query >>> >>> q=mag 778 G 69 >>> >>> is translated as follow: >>> >>> >>> / +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1) >>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1) >>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1) >>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4) >>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.** >>> **0)~0.1)/ >>> >>> It seems that althouhg I declare myfield with this type >>> >>> /<fieldtype name="type1" class="solr.TextField" > >>> >>> <analyzer> >>> <tokenizer class="solr.****KeywordTokenizerFactory*" /> >>> >>> <filter class="solr.****LowerCaseFilterFactory" /> >>> <filter class="solr.****WordDelimiterFilterFactory" >>> generateWordParts="0" generateNumberParts="0" >>> catenateWords="0" catenateNumbers="0" catenateAll="1"**** >>> splitOnCaseChange="0" >>> >>> /> >>> </analyzer> >>> </fieldtype> >>> >>> /SOLR is tokenizing it therefore by producing several tokens >>> (mag,778,g,69)/ >>> / >>> >>> And I can't put double quotes on the query (q="mag 778 G 69") because the >>> request handler searches also in other fields (with different >>> configuration >>> chains) >>> >>> As I understood the query parser, (i.e. query time), does a whitespace >>> tokenization on its own before invoking my (query-time) chain. The same >>> doesn't happen at index time...this is my problem...because at index time >>> the field is analyzed exactly as I want...but unfortunately cannot say >>> the >>> same at query time. >>> >>> Sorry for my wonderful english, did you get the point? >>> >>> >>> On 08/13/2013 02:18 PM, Erick Erickson wrote: >>> >>> On a quick scan I don't see a problem here. Attach >>>> &debug=query to your url and that'll show you the >>>> parsed query, which will in turn show you what's been >>>> pushed through the analysis chain you've defined. >>>> >>>> You haven't stated whether you've tried this and it's >>>> not working or you're looking for guidance as to how >>>> to accomplish this so it's a little unclear how to >>>> respond. >>>> >>>> BTW, the admin/analysis page is your friend here.... >>>> >>>> Best >>>> Erick >>>> >>>> >>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini < >>>> andrea.gazzar...@gmail.com> wrote: >>>> >>>> Clear, thanks for response. >>>> >>>>> So, if I have two fields >>>>> >>>>> <fieldtype name="type1" class="solr.TextField" > >>>>> <analyzer> >>>>> <tokenizer class="solr.******KeywordTokenizerFactory*" /> >>>>> >>>>> <filter class="solr.******LowerCaseFilterFactory" /> >>>>> <filter class="solr.******WordDelimiterFilterFactory" >>>>> >>>>> generateWordParts="0" generateNumberParts="0" >>>>> catenateWords="0" catenateNumbers="0" catenateAll="1" >>>>> splitOnCaseChange="0" /> >>>>> </analyzer> >>>>> </fieldtype> >>>>> <fieldtype name="type2" class="solr.TextField" > >>>>> <analyzer> >>>>> <charFilter class="solr.******MappingCharFilterFactory" >>>>> mapping="mapping-FoldToASCII.******txt"/> >>>>> <tokenizer class="solr.******WhitespaceTokenizerFactory" /> >>>>> <filter class="solr.******LowerCaseFilterFactory" /> >>>>> <filter class="solr.******WordDelimiterFilterFactory" .../> >>>>> >>>>> >>>>> </analyzer> >>>>> </fieldtype> >>>>> >>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field >>>>> type ends with several tokens) >>>>> >>>>> And I want to use the same request handler to query against both of >>>>> them. >>>>> I mean I want the user search something like >>>>> >>>>> http//..../search?q=Mag 78 D 99 >>>>> >>>>> and this search should search within both the first (with type1) and >>>>> second (with type 2) by matching >>>>> >>>>> - a document which has field_with_type1 equals to *mag78d99* or >>>>> - a document which has field_with_type2 that contains a text like "go >>>>> to >>>>> *mag 78*, class *d* and subclass *99*) >>>>> >>>>> >>>>> <requestHandler ....> >>>>> ... >>>>> <str name="defType">dismax</str> >>>>> ... >>>>> <str name="mm">100%</str> >>>>> <str name="qf"> >>>>> field_with_type1 >>>>> field_with_type_2 >>>>> </str> >>>>> ... >>>>> </requestHandler> >>>>> >>>>> is not possible? If so, is possible to do that in some other way? >>>>> >>>>> Sorry for the long email and thanks again >>>>> Andrea >>>>> >>>>> >>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote: >>>>> >>>>> Quoted phrases will be passed to the analyzer as one string, so >>>>> there a >>>>> >>>>>> white space tokenizer is needed. >>>>>> >>>>>> -- Jack Krupansky >>>>>> >>>>>> -----Original Message----- From: Andrea Gazzarini >>>>>> Sent: Monday, August 12, 2013 6:52 AM >>>>>> To: solr-user@lucene.apache.org >>>>>> Subject: Re: Tokenization at query time >>>>>> >>>>>> Hi Tanguy, >>>>>> thanks for fast response. What you are saying corresponds perfectly >>>>>> with >>>>>> the behaviour I'm observing. >>>>>> Now, other than having a big problem (I have several other fields both >>>>>> in the pf and qf where spaces doesn't matter, field types like the >>>>>> "text_en" field type in the example schema) what I'm wondering is: >>>>>> >>>>>> /"The query parser splits the input query on white spaces, and the >>>>>> each >>>>>> token is analysed according to your configuration"// >>>>>> / >>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query >>>>>> analyzer? If the input query is already parsed (i.e. whitespace >>>>>> tokenized) what is its effect? >>>>>> >>>>>> Thank you very much for the help >>>>>> Andrea >>>>>> >>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote: >>>>>> >>>>>> Hello Andrea, >>>>>> >>>>>>> I think you face a rather common issue involving keyword tokenization >>>>>>> and query parsing in Lucene: >>>>>>> The query parser splits the input query on white spaces, and then >>>>>>> each >>>>>>> token is analysed according to your configuration. >>>>>>> So those queries with a whitespace won't behave as expected because >>>>>>> each >>>>>>> token is analysed separately. Consequently, the catenated version of >>>>>>> the >>>>>>> reference cannot be generated. >>>>>>> I think you could try surrounding your query with double quotes or >>>>>>> escaping the space characters in your query using a backslash so that >>>>>>> the >>>>>>> whole sequence is analysed in the same analyser and the catenation >>>>>>> occurs. >>>>>>> You should be aware that this approach has a drawback: you will >>>>>>> probably >>>>>>> not be able to combine the search for Mag. 778 G 69 with other words >>>>>>> in >>>>>>> other fields unless you are able to identify which spaces are to be >>>>>>> escaped: >>>>>>> For example, if input the query is: >>>>>>> Awesome Mag. 778 G 69 >>>>>>> you would want to transform it to: >>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only >>>>>>> or >>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase >>>>>>> query >>>>>>> >>>>>>> Do you get the point? >>>>>>> >>>>>>> Look at the differences between what you tried and the following >>>>>>> examples which should all do what you want: >>>>>>> http://localhost:8983/solr/******collection1/select?q=%22Mag.%****** >>>>>>> 20778%20G%2069%22&debugQuery=******on&qf=text%20myfield&**defType=** >>>>>>> **dismax<http://localhost:****8983/solr/collection1/select?**** >>>>>>> q=%22Mag.%20778%20G%2069%22&****debugQuery=on&qf=text%** >>>>>>> 20myfield&defType=dismax<http:**//localhost:8983/solr/** >>>>>>> collection1/select?q=%22Mag.%**20778%20G%2069%22&debugQuery=** >>>>>>> on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax> >>>>>>> > >>>>>>> OR >>>>>>> >>>>>>> http://localhost:8983/solr/******collection1/select?q=myfield:** >>>>>>> ****Mag<http://localhost:8983/solr/****collection1/select?q=myfield:****Mag> >>>>>>> <http://localhost:8983/**solr/**collection1/select?q=**myfield:**Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag> >>>>>>> > >>>>>>> <http://localhost:8983/**solr/**collection1/select?q=****myfield:Mag<http://localhost:8983/**solr/collection1/select?q=**myfield:Mag> >>>>>>> <http://localhost:**8983/solr/collection1/select?**q=myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag> >>>>>>> > >>>>>>> .\%20778\%20G\%2069&******debugQuery=on >>>>>>> OR >>>>>>> http://localhost:8983/solr/******collection1/select?q=Mag<http://localhost:8983/solr/****collection1/select?q=Mag> >>>>>>> <http**://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag> >>>>>>> > >>>>>>> <http:**//localhost:8983/solr/****collection1/select?q=Mag<htt** >>>>>>> p://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag> >>>>>>> > >>>>>>> .\%**20778\%20G\%2069&****debugQuery=**on&qf=text%** >>>>>>> >>>>>>> 20myfield&defType=**edismax >>>>>>> >>>>>>> >>>>>>> >>>>>>> I hope this helps >>>>>>> >>>>>>> Tanguy >>>>>>> >>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini < >>>>>>> andrea.gazzar...@gmail.com> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have a field (among others)in my schema defined like this: >>>>>>>> >>>>>>>> <fieldtype name="mytype" class="solr.TextField" >>>>>>>> positionIncrementGap="100"> >>>>>>>> <analyzer> >>>>>>>> <tokenizer class="solr.*******KeywordTokenizerFactory*" >>>>>>>> /> >>>>>>>> <filter class="solr.******LowerCaseFilterFactory" /> >>>>>>>> <filter class="solr.******WordDelimiterFilterFactory" >>>>>>>> >>>>>>>> generateWordParts="0" >>>>>>>> generateNumberParts="0" >>>>>>>> catenateWords="0" >>>>>>>> catenateNumbers="0" >>>>>>>> catenateAll="1" >>>>>>>> splitOnCaseChange="0" /> >>>>>>>> </analyzer> >>>>>>>> </fieldtype> >>>>>>>> >>>>>>>> <field name="myfield" type="mytype" indexed="true"/> >>>>>>>> >>>>>>>> Basically, both at index and query time the field value is >>>>>>>> normalized >>>>>>>> like this. >>>>>>>> >>>>>>>> Mag. 778 G 69 => mag778g69 >>>>>>>> >>>>>>>> Now, in my solrconfig I'm using a search handler like this: >>>>>>>> fossero solo le sue le gambe >>>>>>>> >>>>>>>> <requestHandler ....> >>>>>>>> ... >>>>>>>> <str name="defType">dismax</str> >>>>>>>> ... >>>>>>>> <str name="mm">100%</str> >>>>>>>> <str name="qf">myfield^3000</str> >>>>>>>> <str name="pf">myfield^30000</str> >>>>>>>> >>>>>>>> </requestHandler> >>>>>>>> >>>>>>>> What I'm expecting is that if I index a document with a value for my >>>>>>>> field "Mag. 778 G 69", I will be able to get this document by >>>>>>>> querying >>>>>>>> >>>>>>>> 1. Mag. 778 G 69 >>>>>>>> 2. mag 778 g69 >>>>>>>> 3. mag778g69 >>>>>>>> >>>>>>>> But that doesn't wotk: i'm able to get the document only and if >>>>>>>> only I >>>>>>>> use the "normalized2 form: mag778g69 >>>>>>>> >>>>>>>> After doing a little bit of debug, I see that, even I used a >>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing >>>>>>>> soemthign like >>>>>>>> this: >>>>>>>> / >>>>>>>> // +((DisjunctionMaxQuery((//******myfield://*mag*//^3000.0)~0.1) >>>>>>>> DisjunctionMaxQuery((//******myfield://*778*//^3000.0)~0.1) >>>>>>>> DisjunctionMaxQuery((//******myfield://*g*//^3000.0)~0.1) >>>>>>>> DisjunctionMaxQuery((//******myfield://*69*//^3000.0)~0.1))** >>>>>>>> ****~4) >>>>>>>> DisjunctionMaxQuery((//******myfield://*mag778g69*//^30000.** >>>>>>>> ****0)~0.1)/ >>>>>>>> >>>>>>>> >>>>>>>> That is, it is tokenizing the original query string (mag + 778 + g + >>>>>>>> 69) and obviously querying the field for separate tokens doesn't >>>>>>>> match >>>>>>>> anything (at least this is what I think) >>>>>>>> >>>>>>>> Does anybody could please explain me that? >>>>>>>> >>>>>>>> Thanks in advance >>>>>>>> Andrea >>>>>>>> >>>>>>>> >>>>>>>> >