Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Zoran Jeremic Mon, 05 May 2014 11:09:12 -0700

Hi Alex,

Thank you for your explanation. It makes sense now. However, I'm not sure I 
understood your proposal.


So I would adjust the mlt_fields accordingly, and possibly extract the 
relevant portions of texts manually
What do you mean by adjusting mlt_fields? The only shared field that is 
guaranteed to be same is file. Different users could add different titles 
to documents, but attach same or almost the same documents. If I compare 
documents based on the other fields, it doesn't mean that it will match, 
even though attached files are exactly the same.
I'm also not sure what did you mean by extract the relevant portions of 
text manually. How would I do that and what to do with it?

Thanks,
Zoran
 

On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:
>
> Hi Zoran,
>
> Using the attachment type, you can text search over the attached document 
> meta-data, but not its actual content, as it is base 64 encoded. So I would 
> adjust the mlt_fields accordingly, and possibly extract the relevant 
> portions of texts manually. Also set percent_terms_to_match = 0, to ensure 
> that all boolean clauses match. Let me know how this works out for you.
>
> Cheers,
>
> Alex
>
> On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:
>>
>> Hi guys,
>>
>> I have a document that stores a content of html file, pdf, doc  or other 
>> textual document in one of it's fields as byte array using attachment 
>> plugin. Mapping is as follows:
>>
>> { "document":{
>>         "properties":{
>>              "title":{"type":"string","store":true },
>>              "description":{"type":"string","store":"yes"},
>>              "contentType":{"type":"string","store":"yes"},
>>              "url":{"store":"yes", "type":"string"},
>>               "visibility": { "store":"yes", "type":"string"},
>>               "ownerId": {"type": "long",   "store":"yes" },
>>               "relatedToType": { "type": "string", "store":"yes" },
>>               "relatedToId": {"type": "long", "store":"yes" },
>>               "file":{
>>                     "path": "full","type":"attachment",
>>                     "fields":{
>>                         "author": { "type": "string" },
>>                         "title": { "store": true,"type": "string" },
>>                         "keywords": { "type": "string" },
>>                         "file": { "store": true, "term_vector": 
>> "with_positions_offsets","type": "string" },
>>                         "name": { "type": "string" },
>>                         "content_length": { "type": "integer" },
>>                         "date": { "format": "dateOptionalTime", "type": 
>> "date" },
>>                         "content_type": { "type": "string" }
>>     }
>>     }}
>> And the code I'm using to store the document is:
>>
>> VisibilityType.PUBLIC
>>
>> These files seems to be stored fine and I can search content. However, I 
>> need to identify if there are duplicates of web pages or files stored in 
>> ES, so I don't return the same documents to the user as search or 
>> recommendation result. My expectation was that I could use MoreLikeThis 
>> after the document was indexed to identify if there are duplicates of that 
>> document and accordingly to mark it as duplicate. However, results look 
>> weird for me, or I don't understand very well how MoreLikeThis works.
>>
>> For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 
>> times, and all 3 documents in ES have exactly the same binary content 
>> under file. Then for the following query:
>>
>> http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
>> where ID is id of one of these documents I got these results:
>> http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
>> http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
>> http://en.wikipedia.org/wiki/Computational_linguistics with score 
>> 0.48509508
>> ...
>>
>> For some other examples, scores for the same documents are much lower, 
>> and sometimes (though not that often) I don't get duplicates on the first 
>> positions. I would expect here to have score 1.0 or higher for documents 
>> that are exactly the same, but it's not the case, and I can't figure out 
>> how could I identify if there are duplicates in the Elasticsearch index.
>>
>> I would appreciate if somebody could explain if this is expected 
>> behaviour or I didn't use it properly.
>>
>> Thanks,
>> Zoran
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c127e04c-006f-44f6-8a0f-af05e5c46688%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to