Re: Any filter to map mutiple tokens into one ?

T. Kuro Kurosaka Sun, 14 Oct 2012 12:01:40 -0700

Jack,
I don't think SOLR-3261 describes this issue.

I ran the same experiment with Solr-3.6 and the score for all thematches was 0.1626374.

The newly released Solr 4.0.0 also returns a suboptimal score of 0.14764866.


Kuro

On 10/12/12 2:03 PM, Jack Krupansky wrote:

I don't have a Solr 3.5 to check, but SOLR-3261, which was fixed inSolr 3.6 may be your culprit.


See:
https://issues.apache.org/jira/browse/SOLR-3261

So, try SOlr 3.6 or 3.6.1 or 4.0 to see if your issue goes away.

-- Jack Krupansky

-----Original Message----- From: T. Kuro Kurosaka
Sent: Friday, October 12, 2012 3:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Any filter to map mutiple tokens into one ?

Jack,
It goes like this:

http://myhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&debugQuery=on


and edismax is the default query parser in solrconfig.xml.

There is a field named text_jpn that uses a Tokenizer that we developed
as a product, which we can't share here.

But I can simulate our situation using NGramTokenizer.
After indexing the Solr sample docs normally, stop the Solr and insert:

<fieldtype name="text_fake" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.NGramTokenizerFactory"
           maxGramSize="1"
           minGramSize="1" />
</analyzer>
</fieldtype>

Replace the field definition for "name", for example:
<field name="name" type="text_fake" indexed="true" stored="true"/>

In solrconfig.xml, change the default search handler's definition likethis:

<str name="defType">edismax</str>
<str name="pf">name^0.5</str>
(I guess I could just have these in the URL.)

Start Solr and give this URL:

http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=&wt=&debugQuery=on&explainOther=&hl.fl=


Hopefully you'll see
<floatname="score">0.3663672</float>
and
+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((name:"* : *"^0.5))

in the debug output.

The score calculation should not be done when the query is *:* which has
the special meaning, should it ?
And even if the score calculation is done, "*:*" shouldn't be fed to
Tokenizers, should it?

On 10/12/12 9:44 AM, Jack Krupansky wrote:

Okay, let's back up. First, hold off mixing in your proposed solution
until after we understand the actual, original problem:

1. What is your field and field type (with analyzer details)?
2. What is your query parser (defType)?
3. What is your query request URL?
4. What is the parsed query (add &debugQuery=true to your query
request)? (Actually, I think you gave us that)

I just tried the following query with the fresh 4.0 release and it
works fine:

http://localhost:8983/solr/collection1/select?q=*:*&wt=xml&debugQuery=true&defType=edismax



<str name="rawquerystring">*:*</str>

The parsed query is:

<str name="parsedquery">(+MatchAllDocsQuery(*:*))/no_coord</str>

And this was with the 4.0 example schema, adding *.xml and books.json
documents.

If you could try your scenario with 4.0 that would be a help. If it's
a bug in 3.5 that is fixed now... oh well. I mean, feel free to check
the revision history for edismax since the 3.5 release.

-- Jack Krupansky

-----Original Message----- From: T. Kuro Kurosaka
Sent: Friday, October 12, 2012 11:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Any filter to map mutiple tokens into one ?

On 10/11/12 4:47 PM, Jack Krupansky wrote:

The ":" which normally separates a field name from a term (or quoted
string or parenthesized sub-query) is "parsed" by the query parser
before analysis gets called, and "*:*" is recognized before analysis
as well. So, any attempt to recreate "*:*" in analysis will be too
late to affect query parsing and other pre-analysis processing.

That's why I suspect a bug in Solr.  Tokenizer shouldn't play any roles
here but it is affecting the score calculation. I am seeing an evidence
that "*:*" is being passed to my tokenizer.
I'm trying to find a way to work around this by reconstructing "*:*" in
the analysis chain.

But, what is it you are really trying to do? What's the real problem?
(This sounds like a proverbial "XY Problem".)

-- Jack Krupansky

-----Original Message----- From: T. Kuro Kurosaka
Sent: Thursday, October 11, 2012 7:35 PM
To: solr-user@lucene.apache.org
Subject: Any filter to map mutiple tokens into one ?

I am looking for a way to fold a particular sequence of tokens into one
token.
Concretely, I'd like to detect a three-token sequence of "*", ":" and
"*", and replace it with a token of the text "*:*".
I tried SynonymFIlter but it seems it can only deal with a single input
token. "* : * => *:*" seems to be interpreted
as one input token of 5 characters "*", space, ":", space and "*".

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence "*:*" into 3 tokens
of one character each.
The edismax parser, when given the query "*:*", i.e. find every doc,
seems to pass the entire string "*:*" to the query analyzer  (I suspect
a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

<lst name="debug">
<str name="rawquerystring">*:*</str>
<str name="querystring">*:*</str>
<str name="parsedquery">+MatchAllDocsQuery(*:*)
DisjunctionMaxQuery((body:"* : *"~100^0.5 | title:"* :
*"~100^1.2)~0.01)</str>

<str name="parsedquery_toString">+*:* (body:"* : *"~100^0.5 |title:"* :

*"~100^1.2)~0.01</str>

Notice that there is a space between * and : in
DisjunctionMaxQuery((body:"* : *" ....)

Probably because of this, the hit score is as low as 0.109, while it is
1.000 if an analyzer that doesn't break "*:*" is used.
So I'd like to stitch together "*", ":", "*" into "*:*" again to make
DisjunctionMaxQuery happy.


Thanks.


T. "Kuro" Kurosaka

Re: Any filter to map mutiple tokens into one ?

Reply via email to