Re: Need help on similarity ranking approach
Also this plugin could provide a solution to your problem: http://yannbrrd.github.io/ On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote: > > hi, > > What i did now is, i have created a custom similarity & similarity > provider > class which extends DefaultSimilarity and AbstractSimilarityProvider > classes > respectively and overridden the idf() method to return 1. > > Now I'm getting some percentage values like 1, 0.987, 0.876 etc and > interpret it as 100%, 98%, 87% etc. > > Can you please confirm whether this approach can be taken for finding the > percentage of similarity? > > sorry for the late reply. > > Thanks > Rgs > > > > -- > View this message in context: > http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html > > Sent from the ElasticSearch Users mailing list archive at Nabble.com. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4a2ee12-b9af-4142-a2e9-71b85cc9141c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Need help on similarity ranking approach
Hello, I am not sure that would work. I'd first index you document, and then use mlt with this document id and include set to true (added in latest ES release). Then you'll know how "far" your documents are from the queried document. Also, make sure to pick up most of the terms, by setting percent_terms_to_match=0, max_query_terms=high value and min_doc_freq=1. In order to know what terms from the queried document have matched in the response, you can use explain. Alex On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote: > > hi, > > What i did now is, i have created a custom similarity & similarity > provider > class which extends DefaultSimilarity and AbstractSimilarityProvider > classes > respectively and overridden the idf() method to return 1. > > Now I'm getting some percentage values like 1, 0.987, 0.876 etc and > interpret it as 100%, 98%, 87% etc. > > Can you please confirm whether this approach can be taken for finding the > percentage of similarity? > > sorry for the late reply. > > Thanks > Rgs > > > > -- > View this message in context: > http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html > > Sent from the ElasticSearch Users mailing list archive at Nabble.com. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/184a015f-fe68-4a24-999b-367d60d23798%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates
On May 8, 2014 8:09 AM, "Zoran Jeremic" wrote: > > Hi Alex, > > Thank you for this explanation. This really helped me to understand how it works, and now I managed to get results I was expecting just after setting max_query_terms value to be 0 or some very high value. With these results in my tests I was able to identify duplicates. I noticed couple of things though. > > - I got much better results with web pages when I indexed attachment as html source and use text extracted by Jsoup in query, then when I indexed text extracted from web page as attachment and used text in query. I suppose that difference is related to the fact that Jsoup did not extract text in the same way as Tika parser used by ES did. > - There was significant improvement in the results in the second test when I have indexed 50 web pages, then in first test when I indexed 10 web pages. I deleted index before each test. I suppose that this is related to the tf*idf. > If so, does it make sense to provide some training set for elasticsearch that will be used to populate index before system is started to be used? Perhaps you are asking for a background dataset to bias the selection of interesting terms. This could make sense depending on your application. > Could you please define "relevant" in your setting? In a corpus of very similar documents, is your goal to find the ones which are oddly different? Have you looked into ES significant terms? > I have the service that recommends documents to the students based on their current learning context. It creates tokenized string from titles, descriptions and keywords of the course lessons student is working at the moment. I'm using this string as input to the mlt_like_text to find some interesting resources that could help them. > I want to avoid having duplicates (or very similar documents) among top documents that are recommended. > My idea was that during the documents uploading (before I index it with elasticsearch) I find if there already exists it's duplicate, and store this information as ES document field. Later, in query I can specify that duplicates are not recommended. > > Here you should probably strip the html tags, and solely index the text in its own field. > As I already mentioned this didn't give me good results for some reason. > > Do you think this approach would work fine with large textual documents, e.g. pdf documents having couple of hundred of pages? My main concern is related to performances of these queries using like_text, so that's why I was trying to avoid this approach and use mlt with document id as input. I don't think this approach would work well in this case, but you should try. I think what you are after is to either extract good features for your PDF documents and search on that, or finger printing. This could be achieved by playing with analyzers. > Thanks, > Zoran > > > > On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote: >> >> Hi Zoran, >> >> In a nutshell 'more like this' creates a large boolean disjunctive query of 'max_query_terms' number of interesting terms from a text specified in 'like_text'. The interesting terms are picked up with respect to the their tf-idf scores in the whole corpus. These later parameters could be tuned with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The number of boolean clauses that must match is controlled by 'percent_terms_to_match'. In the case of specifying only one field in 'fields', the analyzer used to pick up the terms in 'like_text' is the one associated with the field, unless specified specified by 'analyzer'. So as an example, the default is to create a boolean query of 25 interesting terms where only 30% of the should clauses must match. >> >> On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote: >>> >>> Hi Alex, >>> >>> >>> If you are looking for exact duplicates then hashing the file content, and doing a search for that hash would do the job. >>> This trick won't work for me as these are not exact duplicates. For example, I have 10 students working on the same 100 pages long word document. Each of these students could change only one sentence and upload a document. The hash will be different, but it's 99,99 % same documents. >>> I have the other service that uses mlt_like_text to recommend some relevant documents, and my problem is if this document has best score, then all duplicates will be among top hits and instead recommending users with several most relevant documents I will recommend 10 instances of same document. >> >> >> Could you please define "relevant" in your setting? In a corpus of very similar documents,
Re: more like this on numbers
Hi Valentin, For these types of searches, have you looked into range queries, perhaps combined in a boolean query? Alex On May 7, 2014 4:14 PM, "Valentin" wrote: > Hi Alex, > > thanks. Good idea to convert the numbers into strings. But converting the > number fields to string won't exactly solve my problem. Only if there would > be an analyzer which breaks down numbers into multiple tokens. Eg 300 into > "100", "200", "300" > > Cheers, > Valentin > > On Tuesday, May 6, 2014 12:04:53 PM UTC+2, Alex Ksikes wrote: >> >> Hi Valentin, >> >> As you know, you can only perform mlt on fields which are analyzed. >> However, you can convert your other fields (number, ..) to text using a >> multi field with type string at indexing time. >> >> Cheers, >> >> Alex >> >> On Thursday, March 27, 2014 4:31:58 PM UTC+1, Valentin wrote: >>> >>> Hi, >>> >>> as far as I understand it the more like this query allows to find >>> documents where the same tokens are used. I wonder if there is a >>> possibility to find documents where a particular field is compared based on >>> its value (number). >>> >>> Regards >>> Valentin >>> >>> PS: elasticsearch rocks! >>> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/Wsye6JD__ys/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/195f8fa2-821f-4556-b9ae-8924b35c859f%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/195f8fa2-821f-4556-b9ae-8924b35c859f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMrXmPdWStJjTaW5%3D27MrMNLHPkK1hihgrs%3DDs-SAiHzHz9eAQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates
Hi Zoran, In a nutshell 'more like this' creates a large boolean disjunctive query of 'max_query_terms' number of interesting terms from a text specified in 'like_text'. The interesting terms are picked up with respect to the their tf-idf scores in the whole corpus. These later parameters could be tuned with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The number of boolean clauses that must match is controlled by 'percent_terms_to_match'. In the case of specifying only one field in 'fields', the analyzer used to pick up the terms in 'like_text' is the one associated with the field, unless specified specified by 'analyzer'. So as an example, the default is to create a boolean query of 25 interesting terms where only 30% of the should clauses must match. On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote: > > Hi Alex, > > > If you are looking for exact duplicates then hashing the file content, and > doing a search for that hash would do the job. > This trick won't work for me as these are not exact duplicates. For > example, I have 10 students working on the same 100 pages long word > document. Each of these students could change only one sentence and upload > a document. The hash will be different, but it's 99,99 % same documents. > I have the other service that uses mlt_like_text to recommend some > relevant documents, and my problem is if this document has best score, then > all duplicates will be among top hits and instead recommending users with > several most relevant documents I will recommend 10 instances of same > document. > Could you please define "relevant" in your setting? In a corpus of very similar documents, is your goal to find the ones which are oddly different? Have you looked into ES significant terms? > If you are looking for near duplicates, then I would recommend extracting > whatever text you have in your html, pdf, doc, indexing that and running > more like this with like_text set to that content. > I tried that as well, and results are very disappointing, though I'm not > sure if that would be good idea having in mind that long textual documents > could be used. For testing purpose, I made a simple test with 10 web pages. > Maybe I'm making some mistake there. What I did is to index 10 web pages > and store it in document as attachment. Content is stored as byte[]. Then > I'm using the same 10 pages, extract content using Jsoup, and try to find > similar web pages. Here is the code that I used to find similar web pages > to the provided one: > System.out.println("Duplicates for link:"+link); > System.out.println( > ""); > String indexName=ESIndexNames.INDEX_DOCUMENTS; > String indexType=ESIndexTypes.DOCUMENT; > String mapping = copyToStringFromClasspath( > "/org/prosolo/services/indexing/document-mapping.json"); > client.admin().indices().putMapping(putMappingRequest( > indexName).type(indexType).source(mapping)).actionGet(); > URL url = new URL(link); > org.jsoup.nodes.Document doc=Jsoup.connect(link).get(); > String html=doc.html(); //doc.text(); > QueryBuilder qb = null; > // create the query > qb = QueryBuilders.moreLikeThisQuery("file") > .likeText(html).minTermFreq(0).minDocFreq(0); > SearchResponse sr = client.prepareSearch(ESIndexNames. > INDEX_DOCUMENTS) > .setQuery(qb).addFields("url", "title", "contentType" > ) > .setFrom(0).setSize(5).execute().actionGet(); > if (sr != null) { > SearchHits searchHits = sr.getHits(); > Iterator hitsIter = searchHits.iterator(); > while (hitsIter.hasNext()) { > SearchHit searchHit = hitsIter.next(); > System.out.println("Duplicate:" + searchHit.getId() > + " title:"+searchHit.getFields().get("url"). > getValue()+" score:" + searchHit.getScore()); > } > } > > And results of the execution of this for each of 10 urls is: > > Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic > > Duplicate:Crwk_36bTUCEso1ambs0bA URL:http:// > en.wikipedia.org/wiki/Mathematical_logic score:0.3335998 > Duplicate:--3l-WRuQL2osXg71ixw7A URL:http:// > en.wikipedia.org/wiki/Chemistry score:0.16319205 > Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http:// > en.wikipedia.org/wiki/Formal_science score:0.13035104 > Duplicate:1APeDW0KQnWRv_8mihrz4A > URL:http://en.wikipedia.org/wiki/Starscore:0.12292466 > Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http:// > en.wikipedia.org/wiki/Crystallography score:0.117023855 > > Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics > **
Re: More like this scoring algorithm unclear
Hi Maarten, Your 'like_text' is analyzed, the same way your 'product_id' field is analyzed, unless specified by 'analyzer'. I would recommend setting 'percent_terms_to_match' to 0. However, if you are only searching over product ids then a simple boolean query would do. If not, then I would create a boolean query where each clause is a 'more like this field' for each field of the queried document. This is actually what the mlt API does. Cheers, Alex On Wednesday, January 8, 2014 7:20:05 PM UTC+1, Maarten Roosendaal wrote: > > scoring algorithm is still vague but i got the query to act like the API, > although the results are different so i'm still doing it wrong, here's an > example: > { > "explain": true, > "query": { > "more_like_this": { > "fields": [ > "PRODUCT_ID" > ], > "like_text": "104004855475 1001004002067765 100200494210 > 1002004004499883", > "min_term_freq": 1, > "min_doc_freq": 1, > "max_query_terms": 1, > "percent_terms_to_match": 0.5 > } > }, > "from": 0, > "size": 50, > "sort": [], > "facets": {} > } > > the like_text contains product_id's from a wishlist for which i want to > find similair lists > > Op woensdag 8 januari 2014 16:50:53 UTC+1 schreef Maarten Roosendaal: >> >> Hi, >> >> Thanks, i'm not quite sure how to do that. I'm using: >> http://localhost:9200/lists/list/[id of >> list]/_mlt?mlt_field=product_id&min_term_freq=1&min_doc_freq=1 >> >> the body does not seem to be respected (i'm using the elasticsearch head >> plugin) if i ad: >> { >> "explain": true >> } >> >> i've been trying to rewrite the mlt api as an mlt query but no luck so >> far. Any suggestions? >> >> Thanks, >> Maarten >> >> Op woensdag 8 januari 2014 16:14:25 UTC+1 schreef Justin Treher: >>> >>> Hey Maarten, >>> >>> I would use the "explain":true option to see just why your documents are >>> being scored higher than others. MoreLikeThis using the same fulltext >>> scoring as far as I know, so term position would affect score. >>> >>> >>> http://lucene.apache.org/core/3_0_3/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html >>> >>> Justin >>> >>> On Wednesday, January 8, 2014 3:04:47 AM UTC-5, Maarten Roosendaal wrote: Hi, I have a question about why the 'more like this' algorithm scores documents higher than others, while they are (at first glance) the same. What i've done is index wishlist-documents which contain 1 property: product_id, this property contains an array of product_id's (e.g. [1234, , , ]. What i'm trying to do is find similair wishlist for a given wishlist with id x. The MLT API seems to work, it returns other documents which contain at least 1 of the product_id's from the original list. But what is see is that, for example. i get 10 hits, the first 6 hits contain the same (and only 1) product_id, this product_id is present in the original wishlist. What i would expect is that the score of the first 6 is the same. However what i see is that only the first 2 have the same, the next 2 a lower score and the next 2 even lower. Why is this? Also, i'm trying to write the MLT API as an MLT query, but somehow it doesn't work. I would expect that i need to take the entire content of the original product_id property and feed is as input for the 'like_text'. The documentation is not very clear and doesn't provide examples so i'm a little lost. Hope someone can give some pointers. Thanks, Maarten >>> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91734252-74d0-4001-becc-a184af0f2997%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Interesting Terms for MoreLikeThis Query in ElasticSearch
You could always use explain to find out the best matching terms of any query. In order to get all the interesting terms, you could run a query where the top result document has matched itself. Also the new significant terms might be of interest to you: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html On Thursday, January 30, 2014 9:59:02 PM UTC+1, api...@clearedgeit.com wrote: > > I have been trying to figure out how to get interesting terms using the > MLT query. Does ElasticSearch have this functionality similar to solr or > if not, is there a work around? > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/201edd47-d5d1-4fcf-a520-184737b6b7ec%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Elastic Search MLT API, how to use fields with weights.
I'd like to add to this that mlt API is the same as a boolean query DSL made of multiple more like this field clauses, where each field is set to the content of the field of the queried document. On Thursday, February 20, 2014 4:20:36 PM UTC+1, Binh Ly wrote: > > I do not believe you can boost individual fields/terms separately in a MLT > query. Your best bet is to probably run a bool query of multiple MLT > queries each with a different field and boost, but you'll need to first > extract the MLT text before you can do this. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2fcd0453-58dc-4a66-b7d9-2e785a2a7fa6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: MoreLikeThis ignores queries?
Hello Alexey, You should use the query DSL and not the more like this API. You can create a boolean query where one clause is your more like this query and the other one is your ignore category query (better use a filter here if you can). http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html However, more like this of the DSL only takes a like_text parameter, you cannot pass the id of the document. This will change in a subsequent version of ES. For now, to simulate this functionality, you can use multiple mlt queries with a like_text set to the value of each field of the queried document, inside a boolean query. Let me know if this helps. Alex On Wednesday, March 19, 2014 5:01:06 AM UTC+1, Alexey Bagryancev wrote: > > Anyone can help me? It really does not work... > > среда, 19 марта 2014 г., 2:05:49 UTC+7 пользователь Alexey Bagryancev > написал: >> >> Hi, >> >> I am trying to filter moreLikeThis results by adding additional query - >> but it seems to ignore it at all. >> >> I tried to run my ignoreQuery separately and it works fine, but how to >> make it work with moreLikeThis? Please help me. >> >> $ignoreQuery = $this->IgnoreCategoryQuery('movies') >> >> >> >> $this->resultsSet = $this->index->moreLikeThis( >>new \Elastica\Document($id), >>array_merge($this->mlt_fields, array('search_size' => $this-> >> size, 'search_from' => $this->from)), >>$ignoreQuery); >> >> >> >> My IgnoreCategory function: >> >> public function IgnoreCategoryQuery($category = 'main') >> { >> $categoriesTermQuery = new \Elastica\Query\Term(); >> $categoriesTermQuery->setTerm('categories', $category); >> >> $categoriesBoolQuery = new \Elastica\Query\Bool(); >> $categoriesBoolQuery->addMustNot($categoriesTermQuery); >> >> return $categoriesBoolQuery; >> } >> >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e605d6e2-b42b-4661-b819-90735a9581ec%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: more like this on numbers
Hi Valentin, As you know, you can only perform mlt on fields which are analyzed. However, you can convert your other fields (number, ..) to text using a multi field with type string at indexing time. Cheers, Alex On Thursday, March 27, 2014 4:31:58 PM UTC+1, Valentin wrote: > > Hi, > > as far as I understand it the more like this query allows to find > documents where the same tokens are used. I wonder if there is a > possibility to find documents where a particular field is compared based on > its value (number). > > Regards > Valentin > > PS: elasticsearch rocks! > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d9666a82-cc51-4890-a45f-d695f0600b5a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Need help on similarity ranking approach
Hello, What you want to know is the score of the document that has matched itself using more like this. The API excludes the queried document. However, it is equivalent to running a boolean query of more like this field for each of the queried document field. This will give you as top result, the document that has matched itself, so that you can compute the percentage of similarity of the remaining matched documents. Alex On Friday, May 2, 2014 3:22:34 PM UTC+2, Rgs wrote: > > Thanks Binh Ly and Ivan Brusic for your replies. > > I need to find the similarity in percentage of a document against other > documents and this will be considered for grouping the documents. > > is it possible to get the similarity percentage using more like this > query? > or is any other way to calculate the percentage of similarity from the > query > result? > > Eg: document1 is 90% similar to document2. > document1 is 45% similar to document3 > etc.. > > Thanks > > > > -- > View this message in context: > http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4055227.html > > Sent from the ElasticSearch Users mailing list archive at Nabble.com. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05db016b-1c2e-497c-9275-37dcccedfae3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates
Hi Zoran, If you are looking for exact duplicates then hashing the file content, and doing a search for that hash would do the job. If you are looking for near duplicates, then I would recommend extracting whatever text you have in your html, pdf, doc, indexing that and running more like this with like_text set to that content. Additionally you can perform a mlt search on more fields including the meta-data fields extracted with the attachment plugin. Hope this helps. Alex On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote: > > Hi Alex, > > Thank you for your explanation. It makes sense now. However, I'm not sure > I understood your proposal. > > So I would adjust the mlt_fields accordingly, and possibly extract the > relevant portions of texts manually > What do you mean by adjusting mlt_fields? The only shared field that is > guaranteed to be same is file. Different users could add different titles > to documents, but attach same or almost the same documents. If I compare > documents based on the other fields, it doesn't mean that it will match, > even though attached files are exactly the same. > I'm also not sure what did you mean by extract the relevant portions of > text manually. How would I do that and what to do with it? > > Thanks, > Zoran > > > On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote: >> >> Hi Zoran, >> >> Using the attachment type, you can text search over the attached document >> meta-data, but not its actual content, as it is base 64 encoded. So I would >> adjust the mlt_fields accordingly, and possibly extract the relevant >> portions of texts manually. Also set percent_terms_to_match = 0, to ensure >> that all boolean clauses match. Let me know how this works out for you. >> >> Cheers, >> >> Alex >> >> On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote: >>> >>> Hi guys, >>> >>> I have a document that stores a content of html file, pdf, doc or other >>> textual document in one of it's fields as byte array using attachment >>> plugin. Mapping is as follows: >>> >>> { "document":{ >>> "properties":{ >>> "title":{"type":"string","store":true }, >>> "description":{"type":"string","store":"yes"}, >>> "contentType":{"type":"string","store":"yes"}, >>> "url":{"store":"yes", "type":"string"}, >>> "visibility": { "store":"yes", "type":"string"}, >>> "ownerId": {"type": "long", "store":"yes" }, >>> "relatedToType": { "type": "string", "store":"yes" }, >>> "relatedToId": {"type": "long", "store":"yes" }, >>> "file":{ >>> "path": "full","type":"attachment", >>> "fields":{ >>> "author": { "type": "string" }, >>> "title": { "store": true,"type": "string" }, >>> "keywords": { "type": "string" }, >>> "file": { "store": true, "term_vector": >>> "with_positions_offsets","type": "string" }, >>> "name": { "type": "string" }, >>> "content_length": { "type": "integer" }, >>> "date": { "format": "dateOptionalTime", "type": >>> "date" }, >>> "content_type": { "type": "string" } >>> } >>> }} >>> And the code I'm using to store the document is: >>> >>> VisibilityType.PUBLIC >>> >>> These files seems to be stored fine and I can search content. However, I >>> need to identify if there are duplicates of web pages or files stored in >>> ES, so I don't return the same documents to the user as search or >>> re
Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates
Hi Zoran, Using the attachment type, you can text search over the attached document meta-data, but not its actual content, as it is base 64 encoded. So I would adjust the mlt_fields accordingly, and possibly extract the relevant portions of texts manually. Also set percent_terms_to_match = 0, to ensure that all boolean clauses match. Let me know how this works out for you. Cheers, Alex On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote: > > Hi guys, > > I have a document that stores a content of html file, pdf, doc or other > textual document in one of it's fields as byte array using attachment > plugin. Mapping is as follows: > > { "document":{ > "properties":{ > "title":{"type":"string","store":true }, > "description":{"type":"string","store":"yes"}, > "contentType":{"type":"string","store":"yes"}, > "url":{"store":"yes", "type":"string"}, > "visibility": { "store":"yes", "type":"string"}, > "ownerId": {"type": "long", "store":"yes" }, > "relatedToType": { "type": "string", "store":"yes" }, > "relatedToId": {"type": "long", "store":"yes" }, > "file":{ > "path": "full","type":"attachment", > "fields":{ > "author": { "type": "string" }, > "title": { "store": true,"type": "string" }, > "keywords": { "type": "string" }, > "file": { "store": true, "term_vector": > "with_positions_offsets","type": "string" }, > "name": { "type": "string" }, > "content_length": { "type": "integer" }, > "date": { "format": "dateOptionalTime", "type": > "date" }, > "content_type": { "type": "string" } > } > }} > And the code I'm using to store the document is: > > VisibilityType.PUBLIC > > These files seems to be stored fine and I can search content. However, I > need to identify if there are duplicates of web pages or files stored in > ES, so I don't return the same documents to the user as search or > recommendation result. My expectation was that I could use MoreLikeThis > after the document was indexed to identify if there are duplicates of that > document and accordingly to mark it as duplicate. However, results look > weird for me, or I don't understand very well how MoreLikeThis works. > > For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 > times, and all 3 documents in ES have exactly the same binary content > under file. Then for the following query: > > http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1 > where ID is id of one of these documents I got these results: > http://en.wikipedia.org/wiki/Linguistics with score 0.6633003 > http://en.wikipedia.org/wiki/Linguistics with score 0.6197818 > http://en.wikipedia.org/wiki/Computational_linguistics with score > 0.48509508 > ... > > For some other examples, scores for the same documents are much lower, and > sometimes (though not that often) I don't get duplicates on the first > positions. I would expect here to have score 1.0 or higher for documents > that are exactly the same, but it's not the case, and I can't figure out > how could I identify if there are duplicates in the Elasticsearch index. > > I would appreciate if somebody could explain if this is expected behaviour > or I didn't use it properly. > > Thanks, > Zoran > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7a98b6da-7ff9-4e7a-ab4e-a43d79bb0a50%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.