Hi Zoran, If you are looking for exact duplicates then hashing the file content, and doing a search for that hash would do the job. If you are looking for near duplicates, then I would recommend extracting whatever text you have in your html, pdf, doc, indexing that and running more like this with like_text set to that content. Additionally you can perform a mlt search on more fields including the meta-data fields extracted with the attachment plugin. Hope this helps.
Alex On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote: > > Hi Alex, > > Thank you for your explanation. It makes sense now. However, I'm not sure > I understood your proposal. > > So I would adjust the mlt_fields accordingly, and possibly extract the > relevant portions of texts manually > What do you mean by adjusting mlt_fields? The only shared field that is > guaranteed to be same is file. Different users could add different titles > to documents, but attach same or almost the same documents. If I compare > documents based on the other fields, it doesn't mean that it will match, > even though attached files are exactly the same. > I'm also not sure what did you mean by extract the relevant portions of > text manually. How would I do that and what to do with it? > > Thanks, > Zoran > > > On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote: >> >> Hi Zoran, >> >> Using the attachment type, you can text search over the attached document >> meta-data, but not its actual content, as it is base 64 encoded. So I would >> adjust the mlt_fields accordingly, and possibly extract the relevant >> portions of texts manually. Also set percent_terms_to_match = 0, to ensure >> that all boolean clauses match. Let me know how this works out for you. >> >> Cheers, >> >> Alex >> >> On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote: >>> >>> Hi guys, >>> >>> I have a document that stores a content of html file, pdf, doc or other >>> textual document in one of it's fields as byte array using attachment >>> plugin. Mapping is as follows: >>> >>> { "document":{ >>> "properties":{ >>> "title":{"type":"string","store":true }, >>> "description":{"type":"string","store":"yes"}, >>> "contentType":{"type":"string","store":"yes"}, >>> "url":{"store":"yes", "type":"string"}, >>> "visibility": { "store":"yes", "type":"string"}, >>> "ownerId": {"type": "long", "store":"yes" }, >>> "relatedToType": { "type": "string", "store":"yes" }, >>> "relatedToId": {"type": "long", "store":"yes" }, >>> "file":{ >>> "path": "full","type":"attachment", >>> "fields":{ >>> "author": { "type": "string" }, >>> "title": { "store": true,"type": "string" }, >>> "keywords": { "type": "string" }, >>> "file": { "store": true, "term_vector": >>> "with_positions_offsets","type": "string" }, >>> "name": { "type": "string" }, >>> "content_length": { "type": "integer" }, >>> "date": { "format": "dateOptionalTime", "type": >>> "date" }, >>> "content_type": { "type": "string" } >>> } >>> }} >>> And the code I'm using to store the document is: >>> >>> VisibilityType.PUBLIC >>> >>> These files seems to be stored fine and I can search content. However, I >>> need to identify if there are duplicates of web pages or files stored in >>> ES, so I don't return the same documents to the user as search or >>> recommendation result. My expectation was that I could use MoreLikeThis >>> after the document was indexed to identify if there are duplicates of that >>> document and accordingly to mark it as duplicate. However, results look >>> weird for me, or I don't understand very well how MoreLikeThis works. >>> >>> For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 >>> times, and all 3 documents in ES have exactly the same binary content >>> under file. Then for the following query: >>> >>> http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1 >>> where ID is id of one of these documents I got these results: >>> http://en.wikipedia.org/wiki/Linguistics with score 0.6633003 >>> http://en.wikipedia.org/wiki/Linguistics with score 0.6197818 >>> http://en.wikipedia.org/wiki/Computational_linguistics with score >>> 0.48509508 >>> ... >>> >>> For some other examples, scores for the same documents are much lower, >>> and sometimes (though not that often) I don't get duplicates on the first >>> positions. I would expect here to have score 1.0 or higher for documents >>> that are exactly the same, but it's not the case, and I can't figure out >>> how could I identify if there are duplicates in the Elasticsearch index. >>> >>> I would appreciate if somebody could explain if this is expected >>> behaviour or I didn't use it properly. >>> >>> Thanks, >>> Zoran >>> >>> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f93c682-8f64-463c-95c9-007c63560370%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.