has_parent + compound query scoring problem

2015-02-17 Thread Csaba Dezsényi

Hi All,

I found a weird scoring behavior of elasticsearch when using has_parent 
query together with normal query.

I have two document types in the index:

   - *document*: Normal document
   - *review*: Reviews that are child elements of the documents
   

I would like to create a compound query that searches for both types. The 
features of the query basically:

   - Search in properties of document (title, author, abstract, ...)
   - Search in properties of review (reviewer, review text, ...)
   - Search also in parent properties of review , i.e. the properties of 
   the parent document joined by has_parent query (title, abstract, ...)
   - Have multiple query parts connected by bool or dis_max, e.g. basic 
   query_string query and additional proximity boost part (phrase match + 
   slop) or fuzzy part

I would like to have a harmonized scoring behavior, i.e. if I search for a 
term in a document title, then I would like to receive the matching 
document and all of its reviews with the same score values.
This seemed to work when I started, however, after a while it became weird, 
and I got many inexpiable scores.
Unfortunately, we cannot use the nice explain="true" feature for the 
has_parent query ("not implemented..."), so it is limited for me to debug 
the problem.

I've created a really small curl-based example on gist:
https://gist.github.com/dezsenyi/6a73e953ea6c78bf7774

The last two queries represents the main problem:

   - *Test query 1*: Two query_string query combined with dis_max, one is 
   for documents, the other is for reviews and thus has_parent is applied. It 
   is working fine, the document and the review have the same scores (0,375).
   - *Test query 2*: The very same query above applied twice, combined with 
   a bool query. I expected to have again same scores, but the result scores 
   are different for the document and for the review (???).

After checking the scores, it seems to me that the problem relates to the 
*query_norm* value that is maybe different at the has_parent parts.
For "Test query 1" the result score is 0.375 for both (document, review).
For "Test query 2" the matched review (by has_parent) got exactly 2 x 0.375 
= 0.75, while the document score is less that - I guess - comes mainly from 
the less query_norm value.
However, I could not confirm it, since I cannot see explain for has_parent 
parts...

Can anybody help me?
Thank you in advance!

Regards,
Csaba D.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b9da5514-5c7f-4314-8f40-f3dad8764f6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Impossible to implement real custom boost query when the weight is in the child document?

2014-06-10 Thread Csaba Dezsényi
Thanks Ivan for the tip, but I think the boost_mode is just fine in my 
queries. The problem is that I only can access the field of the child 
document, if I have an additional bool part query with the has_child query 
inside. This causes the sum. The custom score is multiplied with the 
has_child query score that is correct.

I also think that this is a bug..

Thanks,
Csaba

2014. június 6., péntek 18:52:39 UTC+2 időpontban Ivan Brusic a következőt 
írta:
>
> Did you change the boost_mode of your function score script? The default 
> should be "multiply", which is the behavior you want, not "sum", which is 
> what you are experiencing.
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
>
> I have never used it with nested documents, so perhaps it is a bug (or a 
> feature :) )
>
> -- 
> Ivan
>
>
> On Fri, Jun 6, 2014 at 3:55 AM, Csaba Dezsényi  > wrote:
>
>> I could find only one related post:
>>
>> https://groups.google.com/forum/#!msg/elasticsearch/EGCeJZbhVtA/i32ROGVmFswJ
>> But this has different question...
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/6152742a-4d32-47a4-890d-49cd6a4dd291%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/6152742a-4d32-47a4-890d-49cd6a4dd291%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f5b5354c-9849-4e7b-a171-33fd63b907cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Impossible to implement real custom boost query when the weight is in the child document?

2014-06-06 Thread Csaba Dezsényi
I could find only one related post:
https://groups.google.com/forum/#!msg/elasticsearch/EGCeJZbhVtA/i32ROGVmFswJ
But this has different question...

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6152742a-4d32-47a4-890d-49cd6a4dd291%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Impossible to implement real custom boost query when the weight is in the child document?

2014-06-06 Thread Csaba Dezsényi
Hello Everyone,

I would like to implement a popularity-based boost in my elasticsearch 
engine. I calculate custom popularity boost factors for documents 
periodically, but I store these float numbers in a child document, because 
I want to avoid the full reindex of the main article documents.

The mapping of the child document is the following:

{

  "document_boost": {
"_parent": {
  "type": "document"
},
"popular_boost_total": {
  "type": "float"
},
"popular_boost_recent": {
  "type": "float"
},
"last_updated": {
  "type": "date"
}
  }
}

I would like to create query that:

   - executes the main query provided by the end users
   - attach the child document (1-1 relation to the parent)
   - boost the score of the main query by multiplying with the custom boost 
   factors that are read from the child document (popular_boost_total, 
   popular_boost_recent)
   
I have been struggling with this for a while, and could not find the real 
nice solution. The best solution that I could find is the following 
(simplified):

GET index/document/_search
{
  "query": {
"bool": {
  "must": [
{
  "match": {
"title": "basketball"
  }
}
  ],
  "should": [
{
  "has_child": {
"type": "document_boost",
"query": {
  "function_score": {
"script_score": {
  "script": 
"doc['document_boost.popular_boost_total'].value"
}
  }
}
  }
}
  ]
}
  }
}

However, this is not a real boost, because the second bool part is an 
additional score, not a multiplication on the primary query score! In this 
case, the amount of boost cannot be expressed as a clean percentage, but a 
noisy additional score and the real boosting factor is depends on the 
absolute score value of the particular query. So, I think it is wrong.
I would be able to solve it, if the custom boost factors would not be in 
chid documents, but in the parent document fields:

GET index/document/_search
{
  "query": {
"function_score": {
  "query": {
"match": {
  "title": "basketball"
}
  },
  "script_score": {
"script": "doc['popular_boost_recent'].value"
  }
}
  }
}

Well, it i obvious, it the above case we do not need the has_child query.
I also tried without the bool query:

GET index/document/_search
{
  "query": {
"function_score": {
  "query": {
"match": {
  "title": "basketball"
}
  },
  "functions": [
{
  "filter" : {
"has_child": {
  "type": "document_boost",
  "query": {"match_all": {}}
}
  },
  "script_score": {
"script": "doc['document_boost.popular_boost_recent'].value"
  }
}
  ]
}
  }
}

In the above case, the script reads the value from the parent document, not 
from the child! Well, anyway, it seems a bug, since I explicitly define the 
full qualified name.

I think - considering the possibilities of the query API syntax - the last 
query above would be the solution for the real multiplication boosting, but 
it simpli does not work.
Another solution can be if I would be able to define the score mode for the 
bool query, i.e. to tell elastic search not to add, but multiply the scores 
of the parts.

Are there others who are facing with the same issue? I think it is a common 
request nowadays to have some kind of popularity and other kind of custom 
boosts.
Can somebody give me a hint? I hope I just misunderstood something...

Thanks!

Regards,
Csaba





-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/af4a19e4-1b1c-4702-a016-c88a6c76d04b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: maxDocs different between primary and replica shards

2014-05-08 Thread Csaba Dezsényi
I exactly have the same issue!
Does someone have solution for this?

Thanks,
Csaba

2013. november 28., csütörtök 14:26:51 UTC+1 időpontban Klaus Brunner a 
következőt írta:
>
> We're running Elasticsearch (currently 0.90.6) in what I'd call a 
> "replicated" architecture: our indexes are quite small (tens of thousands 
> of documents) and fit easily on a single machine, so we allocate a single 
> shard per index. However, we make sure that they are replicated to each 
> node of our cluster. The whole approach ensures that each application 
> server has its own "local" ES with all data of an index and can keep 
> working autonomously if others fail. This works alright so far.
>
> Now, we're seeing small but visible score discrepancies between ES nodes, 
> specifically between the primary shard and the replicas. Using explain, we 
> found out that the difference is in the maxDocs value. As known and 
> documented, deleted documents may still contribute to the maxDocs value 
> (and thus, affect TF-IDF scores). That's not a problem per se. 
>
> The problem is rather that maxDocs is different between the primary and 
> the replica shards (until we restart ES or force a merge using the optimize 
> call). Depending on whether the primary or a replica is hit with the exact 
> same query, we get different scores because the maxDocs value is different 
> by exactly the number of documents that have been deleted previously.
>
> Is there any way to ensure that maxDocs is the same on primary and replica 
> shards, short of forcing a costly merge?
>
> (Using DFS queries or not makes no difference, as I would expect from my 
> understanding of them - the index isn't really distributed, it's 
> replicated.)
>
> Thanks
>
> Klaus
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/aa7bfbea-8e81-474a-bc5c-edda55e707a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.