Thank you Alex. At the moment it works fine even with large documents, but I'll test if I can reach similar results with interesting terms.
Best, Zoran On Thursday, 8 May 2014 02:02:24 UTC-7, Alex Ksikes wrote: > > On May 8, 2014 8:09 AM, "Zoran Jeremic" <zoran....@gmail.com <javascript:>> > wrote: > > > > Hi Alex, > > > > Thank you for this explanation. This really helped me to understand how > it works, and now I managed to get results I was expecting just after > setting max_query_terms value to be 0 or some very high value. With these > results in my tests I was able to identify duplicates. I noticed couple of > things though. > > > > - I got much better results with web pages when I indexed attachment as > html source and use text extracted by Jsoup in query, then when I indexed > text extracted from web page as attachment and used text in query. I > suppose that difference is related to the fact that Jsoup did not extract > text in the same way as Tika parser used by ES did. > > - There was significant improvement in the results in the second test > when I have indexed 50 web pages, then in first test when I indexed 10 web > pages. I deleted index before each test. I suppose that this is related to > the tf*idf. > > If so, does it make sense to provide some training set for elasticsearch > that will be used to populate index before system is started to be used? > > Perhaps you are asking for a background dataset to bias the selection of > interesting terms. This could make sense depending on your application. > > > Could you please define "relevant" in your setting? In a corpus of very > similar documents, is your goal to find the ones which are oddly different? > Have you looked into ES significant terms? > > I have the service that recommends documents to the students based on > their current learning context. It creates tokenized string from titles, > descriptions and keywords of the course lessons student is working at the > moment. I'm using this string as input to the mlt_like_text to find some > interesting resources that could help them. > > I want to avoid having duplicates (or very similar documents) among top > documents that are recommended. > > My idea was that during the documents uploading (before I index it with > elasticsearch) I find if there already exists it's duplicate, and store > this information as ES document field. Later, in query I can specify that > duplicates are not recommended. > > > > Here you should probably strip the html tags, and solely index the text > in its own field. > > As I already mentioned this didn't give me good results for some reason. > > > > Do you think this approach would work fine with large textual documents, > e.g. pdf documents having couple of hundred of pages? My main concern is > related to performances of these queries using like_text, so that's why I > was trying to avoid this approach and use mlt with document id as input. > > I don't think this approach would work well in this case, but you should > try. I think what you are after is to either extract good features for your > PDF documents and search on that, or finger printing. This could be > achieved by playing with analyzers. > > > Thanks, > > Zoran > > > > > > > > On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote: > >> > >> Hi Zoran, > >> > >> In a nutshell 'more like this' creates a large boolean disjunctive > query of 'max_query_terms' number of interesting terms from a text > specified in 'like_text'. The interesting terms are picked up with respect > to the their tf-idf scores in the whole corpus. These later parameters > could be tuned with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' > parameters. The number of boolean clauses that must match is controlled by > 'percent_terms_to_match'. In the case of specifying only one field in > 'fields', the analyzer used to pick up the terms in 'like_text' is the one > associated with the field, unless specified specified by 'analyzer'. So as > an example, the default is to create a boolean query of 25 interesting > terms where only 30% of the should clauses must match. > >> > >> On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote: > >>> > >>> Hi Alex, > >>> > >>> > >>> If you are looking for exact duplicates then hashing the file content, > and doing a search for that hash would do the job. > >>> This trick won't work for me as these are not exact duplicates. For > example, I have 10 students working on the same 100 pages long word > document. Each of these students could change only one sentence and upload > a document. The hash will be different, but it's 99,99 % same documents. > >>> I have the other service that uses mlt_like_text to recommend some > relevant documents, and my problem is if this document has best score, then > all duplicates will be among top hits and instead recommending users with > several most relevant documents I will recommend 10 instances of same > document. > >> > >> > >> Could you please define "relevant" in your setting? In a corpus of very > similar documents, is your goal to find the ones which are oddly different? > Have you looked into ES significant terms? > >> > >>> > >>> If you are looking for near duplicates, then I would recommend > extracting whatever text you have in your html, pdf, doc, indexing that and > running more like this with like_text set to that content. > >>> I tried that as well, and results are very disappointing, though I'm > not sure if that would be good idea having in mind that long textual > documents could be used. For testing purpose, I made a simple test with 10 > web pages. Maybe I'm making some mistake there. What I did is to index 10 > web pages and store it in document as attachment. Content is stored as > byte[]. Then I'm using the same 10 pages, extract content using Jsoup, and > try to find similar web pages. Here is the code that I used to find similar > web pages to the provided one: > >>> System.out.println("Duplicates for link:"+link); > >>> > System.out.println("************************************************"); > >>> String indexName=ESIndexNames.INDEX_DOCUMENTS; > >>> String indexType=ESIndexTypes.DOCUMENT; > >>> String mapping = > copyToStringFromClasspath("/org/prosolo/services/indexing/document-mapping.json"); > >>> > > client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet(); > >>> URL url = new URL(link); > >>> org.jsoup.nodes.Document doc=Jsoup.connect(link).get(); > >>> String html=doc.html(); //doc.text(); > >>> QueryBuilder qb = null; > >>> // create the query > >>> qb = QueryBuilders.moreLikeThisQuery("file") > >>> .likeText(html).minTermFreq(0).minDocFreq(0); > >>> SearchResponse sr = > client.prepareSearch(ESIndexNames.INDEX_DOCUMENTS) > >>> .setQuery(qb).addFields("url", "title", > "contentType") > >>> .setFrom(0).setSize(5).execute().actionGet(); > >>> if (sr != null) { > >>> SearchHits searchHits = sr.getHits(); > >>> Iterator<SearchHit> hitsIter = searchHits.iterator(); > >>> while (hitsIter.hasNext()) { > >>> SearchHit searchHit = hitsIter.next(); > >>> System.out.println("Duplicate:" + > searchHit.getId() > >>> + " > title:"+searchHit.getFields().get("url").getValue()+" score:" + > searchHit.getScore()); > >>> } > >>> } > >>> > >>> And results of the execution of this for each of 10 urls is: > >>> > >>> Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic > >>> ************************************************ > >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL: > http://en.wikipedia.org/wiki/Mathematical_logic score:0.3335998 > >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL: > http://en.wikipedia.org/wiki/Chemistry score:0.16319205 > >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL: > http://en.wikipedia.org/wiki/Formal_science score:0.13035104 > >>> Duplicate:1APeDW0KQnWRv_8mihrz4A > >>> URL:http://en.wikipedia.org/wiki/Starscore:0.12292466 > >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL: > http://en.wikipedia.org/wiki/Crystallography score:0.117023855 > >>> > >>> Duplicates for link: > http://en.wikipedia.org/wiki/Mathematical_statistics > >>> ************************************************ > >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL: > http://en.wikipedia.org/wiki/Mathematical_logic score:0.1570246 > >>> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL: > http://en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403 > >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL: > http://en.wikipedia.org/wiki/Chemistry score:0.09323166 > >>> Duplicate:1APeDW0KQnWRv_8mihrz4A > >>> URL:http://en.wikipedia.org/wiki/Starscore:0.09279101 > >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL: > http://en.wikipedia.org/wiki/Formal_science score:0.08606046 > >>> > >>> Duplicates for link:http://en.wikipedia.org/wiki/Formal_science > >>> ************************************************ > >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL: > http://en.wikipedia.org/wiki/Formal_science score:0.12439237 > >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL: > http://en.wikipedia.org/wiki/Chemistry score:0.11299215 > >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL: > http://en.wikipedia.org/wiki/Mathematical_logic score:0.107585154 > >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL: > http://en.wikipedia.org/wiki/Crystallography score:0.07795183 > >>> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL: > http://en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285 > >>> > >>> Duplicates for link:http://en.wikipedia.org/wiki/Star > >>> ************************************************ > >>> Duplicate:1APeDW0KQnWRv_8mihrz4A > >>> URL:http://en.wikipedia.org/wiki/Starscore:0.21684575 > >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL: > http://en.wikipedia.org/wiki/Crystallography score:0.15316588 > >>> Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL: > http://en.wikipedia.org/wiki/Cosmology score:0.123572096 > >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL: > http://en.wikipedia.org/wiki/Chemistry score:0.1177105 > >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL: > http://en.wikipedia.org/wiki/Mathematical_logic score:0.11373919 > >>> > >>> Duplicates for link:http://en.wikipedia.org/wiki/Chemistry > >>> ************************************************ > >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL: > http://en.wikipedia.org/wiki/Chemistry score:0.13033955 > >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL: > http://en.wikipedia.org/wiki/Crystallography score:0.121021904 > >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo > >> > >> > >> Here you should probably strip the html tags, and solely index the text > in its own field. > > > > -- > > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/rc580pOzYCs/unsubscribe. > > To unsubscribe from this group and all its topics, send an email to > elasticsearc...@googlegroups.com <javascript:>. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/a92beaad-05bf-431b-9a37-f51512f50aa8%40googlegroups.com > . > > > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1842c492-29d8-4339-b490-2c7235535495%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.