Hi Zoran, In a nutshell 'more like this' creates a large boolean disjunctive query of 'max_query_terms' number of interesting terms from a text specified in 'like_text'. The interesting terms are picked up with respect to the their tf-idf scores in the whole corpus. These later parameters could be tuned with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The number of boolean clauses that must match is controlled by 'percent_terms_to_match'. In the case of specifying only one field in 'fields', the analyzer used to pick up the terms in 'like_text' is the one associated with the field, unless specified specified by 'analyzer'. So as an example, the default is to create a boolean query of 25 interesting terms where only 30% of the should clauses must match.
On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote: > > Hi Alex, > > > If you are looking for exact duplicates then hashing the file content, and > doing a search for that hash would do the job. > This trick won't work for me as these are not exact duplicates. For > example, I have 10 students working on the same 100 pages long word > document. Each of these students could change only one sentence and upload > a document. The hash will be different, but it's 99,99 % same documents. > I have the other service that uses mlt_like_text to recommend some > relevant documents, and my problem is if this document has best score, then > all duplicates will be among top hits and instead recommending users with > several most relevant documents I will recommend 10 instances of same > document. > Could you please define "relevant" in your setting? In a corpus of very similar documents, is your goal to find the ones which are oddly different? Have you looked into ES significant terms? > If you are looking for near duplicates, then I would recommend extracting > whatever text you have in your html, pdf, doc, indexing that and running > more like this with like_text set to that content. > I tried that as well, and results are very disappointing, though I'm not > sure if that would be good idea having in mind that long textual documents > could be used. For testing purpose, I made a simple test with 10 web pages. > Maybe I'm making some mistake there. What I did is to index 10 web pages > and store it in document as attachment. Content is stored as byte[]. Then > I'm using the same 10 pages, extract content using Jsoup, and try to find > similar web pages. Here is the code that I used to find similar web pages > to the provided one: > System.out.println("Duplicates for link:"+link); > System.out.println( > "************************************************"); > String indexName=ESIndexNames.INDEX_DOCUMENTS; > String indexType=ESIndexTypes.DOCUMENT; > String mapping = copyToStringFromClasspath( > "/org/prosolo/services/indexing/document-mapping.json"); > client.admin().indices().putMapping(putMappingRequest( > indexName).type(indexType).source(mapping)).actionGet(); > URL url = new URL(link); > org.jsoup.nodes.Document doc=Jsoup.connect(link).get(); > String html=doc.html(); //doc.text(); > QueryBuilder qb = null; > // create the query > qb = QueryBuilders.moreLikeThisQuery("file") > .likeText(html).minTermFreq(0).minDocFreq(0); > SearchResponse sr = client.prepareSearch(ESIndexNames. > INDEX_DOCUMENTS) > .setQuery(qb).addFields("url", "title", "contentType" > ) > .setFrom(0).setSize(5).execute().actionGet(); > if (sr != null) { > SearchHits searchHits = sr.getHits(); > Iterator<SearchHit> hitsIter = searchHits.iterator(); > while (hitsIter.hasNext()) { > SearchHit searchHit = hitsIter.next(); > System.out.println("Duplicate:" + searchHit.getId() > + " title:"+searchHit.getFields().get("url"). > getValue()+" score:" + searchHit.getScore()); > } > } > > And results of the execution of this for each of 10 urls is: > > Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic > ************************************************ > Duplicate:Crwk_36bTUCEso1ambs0bA URL:http:// > en.wikipedia.org/wiki/Mathematical_logic score:0.3335998 > Duplicate:--3l-WRuQL2osXg71ixw7A URL:http:// > en.wikipedia.org/wiki/Chemistry score:0.16319205 > Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http:// > en.wikipedia.org/wiki/Formal_science score:0.13035104 > Duplicate:1APeDW0KQnWRv_8mihrz4A > URL:http://en.wikipedia.org/wiki/Starscore:0.12292466 > Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http:// > en.wikipedia.org/wiki/Crystallography score:0.117023855 > > Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics > ************************************************ > Duplicate:Crwk_36bTUCEso1ambs0bA URL:http:// > en.wikipedia.org/wiki/Mathematical_logic score:0.1570246 > Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http:// > en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403 > Duplicate:--3l-WRuQL2osXg71ixw7A URL:http:// > en.wikipedia.org/wiki/Chemistry score:0.09323166 > Duplicate:1APeDW0KQnWRv_8mihrz4A > URL:http://en.wikipedia.org/wiki/Starscore:0.09279101 > Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http:// > en.wikipedia.org/wiki/Formal_science score:0.08606046 > > Duplicates for link:http://en.wikipedia.org/wiki/Formal_science > ************************************************ > Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http:// > en.wikipedia.org/wiki/Formal_science score:0.12439237 > Duplicate:--3l-WRuQL2osXg71ixw7A URL:http:// > en.wikipedia.org/wiki/Chemistry score:0.11299215 > Duplicate:Crwk_36bTUCEso1ambs0bA URL:http:// > en.wikipedia.org/wiki/Mathematical_logic score:0.107585154 > Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http:// > en.wikipedia.org/wiki/Crystallography score:0.07795183 > Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http:// > en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285 > > Duplicates for link:http://en.wikipedia.org/wiki/Star > ************************************************ > Duplicate:1APeDW0KQnWRv_8mihrz4A > URL:http://en.wikipedia.org/wiki/Starscore:0.21684575 > Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http:// > en.wikipedia.org/wiki/Crystallography score:0.15316588 > Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:http:// > en.wikipedia.org/wiki/Cosmology score:0.123572096 > Duplicate:--3l-WRuQL2osXg71ixw7A URL:http:// > en.wikipedia.org/wiki/Chemistry score:0.1177105 > Duplicate:Crwk_36bTUCEso1ambs0bA URL:http:// > en.wikipedia.org/wiki/Mathematical_logic score:0.11373919 > > Duplicates for link:http://en.wikipedia.org/wiki/Chemistry > ************************************************ > Duplicate:--3l-WRuQL2osXg71ixw7A URL:http:// > en.wikipedia.org/wiki/Chemistry score:0.13033955 > Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http:// > en.wikipedia.org/wiki/Crystallography score:0.121021904 > Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo > Here you should probably strip the html tags, and solely index the text in its own field. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c30400c5-ce33-4cb7-9335-759b3923ae14%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.