I have been trying to figure out how exactly the more_like_this query 
behaves. The doc says "Under the hood, more_like_this simply creates 
multiple should clauses in a bool query of interesting terms extracted from 
some provided text." But I found several examples that I could not explain. 
This one illustrates it:

I am using elasticsearch-1.4.0. I am creating an index like this (no 
mapping defined before):
curl -XPUT 'localhost:9200/twitter/tweet/1' -d '{"user" : "user1", 
"message" : "aaa"}'
curl -XPUT 'localhost:9200/twitter/tweet/2' -d '{"user" : "user1", 
"message" : "aaa bbb"}'
curl -XPUT 'localhost:9200/twitter/tweet/3' -d '{"user" : "user1", 
"message" : "bbb aaa"}'
curl -XPUT 'localhost:9200/twitter/tweet/4' -d '{"user" : "user2", 
"message" : "bbb"}'
curl -XPUT 'localhost:9200/twitter/tweet/5' -d '{"user" : "user2", 
"message" : "aaa bbb"}'
curl -XPUT 'localhost:9200/twitter/tweet/6' -d '{"user" : "user2", 
"message" : "bbb aaa"}'

Then I query it:
curl -XGET 
'http://localhost:9200/twitter/tweet/_search?pretty=true&size=10' -d '{
    "query": {
        "more_like_this_field": {
            "message": {
                "like_text": "aaa bbb",
                "percent_terms_to_match": 1,
                "min_term_freq": 1,
                "max_query_terms": 3,
                "min_doc_freq": 1
            }
        }
    }
}
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 14.4000225,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "4",
      "_score" : 14.4000225,
      "_source":{"user" : "user2", "message" : "bbb"}
    }, {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "2",
      "_score" : 12.729599,
      "_source":{"user" : "user1", "message" : "aaa bbb"}
    }, {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "5",
      "_score" : 12.72813,
      "_source":{"user" : "user2", "message" : "aaa bbb"}
    }, {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "3",
      "_score" : 12.728111,
      "_source":{"user" : "user1", "message" : "bbb aaa"}
    }, {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "6",
      "_score" : 12.5501995,
      "_source":{"user" : "user2", "message" : "bbb aaa"}
    } ]
  }
}

So text 1 "aaa" is missing. I get the same result if I use "like_text": 
"bbb aaa" in the above query. However, if I use "like_text": "aaa" I get 
what I would expect: All texts except "bbb" are returned.

What kind of should-query is generated by more_like_this in the above 
example? I would have expected:
curl -XGET 
'http://localhost:9200/twitter/tweet/_search?pretty=true&size=10' -d '{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "message": "aaa"
                    }
                },
                {
                    "match": {
                        "message": "bbb"
                    }
                }
            ],
            "minimum_should_match": 2
        }
    }
}'
but this obviously returns neither "aaa" nor "bbb".


Why does the above more_like_this query return "bbb" but not "aaa"?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/53fae773-9359-4a1a-980e-a42d1dfd6d0f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to