Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Zoran Jeremic Fri, 09 May 2014 20:35:19 -0700

Thank you Alex. At the moment it works fine even with large documents, but 
I'll test if I can reach similar results with interesting terms.


Best,
Zoran

On Thursday, 8 May 2014 02:02:24 UTC-7, Alex Ksikes wrote:
>
> On May 8, 2014 8:09 AM, "Zoran Jeremic" <zoran....@gmail.com <javascript:>> 
> wrote:
> >
> > Hi Alex,
> >
> > Thank you for this explanation. This really helped me to understand how 
> it works, and now I managed to get results I was expecting just after 
> setting max_query_terms value to be 0 or some very high value. With these 
> results in my tests I was able to identify duplicates. I noticed couple of 
> things though. 
> >
> > - I got much better results with web pages when I indexed attachment as 
> html source and use text extracted by Jsoup in query, then when I indexed 
> text extracted from web page as attachment and used text in query. I 
> suppose that difference is related to the fact that Jsoup did not extract 
> text in the same way as Tika parser used by ES did. 
> > - There was significant improvement in the results in the second test 
> when I have indexed 50 web pages, then in first test when I indexed 10 web 
> pages. I deleted index before each test. I suppose that this is related to 
> the tf*idf. 
> > If so, does it make sense to provide some training set for elasticsearch 
> that will be used to populate index before system is started to be used?
>
> Perhaps you are asking for a background dataset to bias the selection of 
> interesting terms. This could make sense depending on your application.
>
> > Could you please define "relevant" in your setting? In a corpus of very 
> similar documents, is your goal to find the ones which are oddly different? 
> Have you looked into ES significant terms?
> > I have the service that recommends documents to the students based on 
> their current learning context. It creates tokenized string from titles, 
> descriptions and keywords of the course lessons student is working at the 
> moment. I'm using this string as input to the mlt_like_text to find some 
> interesting resources that could help them. 
> > I want to avoid having duplicates (or very similar documents) among top 
> documents that are recommended. 
> > My idea was that during the documents uploading (before I index it with 
> elasticsearch) I find if there already exists it's duplicate, and store 
> this information as ES document field. Later, in query I can specify that 
> duplicates are not recommended. 
> >
> > Here you should probably strip the html tags, and solely index the text 
> in its own field. 
> > As I already mentioned this didn't give me good results for some reason.
> >
> > Do you think this approach would work fine with large textual documents, 
> e.g. pdf documents having couple of hundred of pages? My main concern is 
> related to performances of these queries using like_text, so that's why I 
> was trying to avoid this approach and use mlt with document id as input.
>
> I don't think this approach would work well in this case, but you should 
> try. I think what you are after is to either extract good features for your 
> PDF documents and search on that, or finger printing. This could be 
> achieved by playing with analyzers.
>
> > Thanks,
> > Zoran
> >
> >
> >
> > On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote:
> >>
> >> Hi Zoran,
> >>
> >> In a nutshell 'more like this' creates a large boolean disjunctive 
> query of 'max_query_terms' number of interesting terms from a text 
> specified in 'like_text'. The interesting terms are picked up with respect 
> to the their tf-idf scores in the whole corpus. These later parameters 
> could be tuned with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' 
> parameters. The number of boolean clauses that must match is controlled by 
> 'percent_terms_to_match'. In the case of specifying only one field in 
> 'fields', the analyzer used to pick up the terms in 'like_text' is the one 
> associated with the field, unless specified specified by 'analyzer'. So as 
> an example, the default is to create a boolean query of 25 interesting 
> terms where only 30% of the should clauses must match.
> >>
> >> On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:
> >>>
> >>> Hi Alex,
> >>>
> >>>
> >>> If you are looking for exact duplicates then hashing the file content, 
> and doing a search for that hash would do the job.
> >>> This trick won't work for me as these are not exact duplicates. For 
> example, I have 10 students working on the same 100 pages long word 
> document. Each of these students could change only one sentence and upload 
> a document. The hash will be different, but it's 99,99 % same documents. 
> >>> I have the other service that uses mlt_like_text to recommend some 
> relevant documents, and my problem is if this document has best score, then 
> all duplicates will be among top hits and instead recommending users with 
> several most relevant documents I will recommend 10 instances of same 
> document. 
> >>
> >>
> >> Could you please define "relevant" in your setting? In a corpus of very 
> similar documents, is your goal to find the ones which are oddly different? 
> Have you looked into ES significant terms?
> >>  
> >>>
> >>> If you are looking for near duplicates, then I would recommend 
> extracting whatever text you have in your html, pdf, doc, indexing that and 
> running more like this with like_text set to that content.
> >>> I tried that as well, and results are very disappointing, though I'm 
> not sure if that would be good idea having in mind that long textual 
> documents could be used. For testing purpose, I made a simple test with 10 
> web pages. Maybe I'm making some mistake there. What I did is to index 10 
> web pages and store it in document as attachment. Content is stored as 
> byte[]. Then I'm using the same 10 pages, extract content using Jsoup, and 
> try to find similar web pages. Here is the code that I used to find similar 
> web pages to the provided one:
> >>> System.out.println("Duplicates for link:"+link);
> >>>             
>  System.out.println("************************************************");
> >>>              String indexName=ESIndexNames.INDEX_DOCUMENTS;
> >>>              String indexType=ESIndexTypes.DOCUMENT;
> >>>              String mapping = 
> copyToStringFromClasspath("/org/prosolo/services/indexing/document-mapping.json");
> >>>             
>  
> client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
> >>>              URL url = new URL(link);
> >>>             org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
> >>>               String html=doc.html(); //doc.text();
> >>>              QueryBuilder qb = null;
> >>>              // create the query
> >>>              qb = QueryBuilders.moreLikeThisQuery("file")
> >>>                      .likeText(html).minTermFreq(0).minDocFreq(0);
> >>>              SearchResponse sr = 
> client.prepareSearch(ESIndexNames.INDEX_DOCUMENTS)
> >>>                      .setQuery(qb).addFields("url", "title", 
> "contentType")
> >>>                      .setFrom(0).setSize(5).execute().actionGet();
> >>>              if (sr != null) {
> >>>                  SearchHits searchHits = sr.getHits();
> >>>                  Iterator<SearchHit> hitsIter = searchHits.iterator();
> >>>                  while (hitsIter.hasNext()) {
> >>>                      SearchHit searchHit = hitsIter.next();
> >>>                      System.out.println("Duplicate:" + 
> searchHit.getId()
> >>>                              + " 
> title:"+searchHit.getFields().get("url").getValue()+" score:" + 
> searchHit.getScore());
> >>>                       }
> >>>              }
> >>>
> >>> And results of the execution of this for each of 10 urls is:
> >>>  
> >>> Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic
> >>> ************************************************
> >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
> http://en.wikipedia.org/wiki/Mathematical_logic score:0.3335998
> >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
> http://en.wikipedia.org/wiki/Chemistry score:0.16319205
> >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
> http://en.wikipedia.org/wiki/Formal_science score:0.13035104
> >>> Duplicate:1APeDW0KQnWRv_8mihrz4A 
> >>> URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
> >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
> http://en.wikipedia.org/wiki/Crystallography score:0.117023855
> >>>
> >>> Duplicates for link:
> http://en.wikipedia.org/wiki/Mathematical_statistics
> >>> ************************************************
> >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
> http://en.wikipedia.org/wiki/Mathematical_logic score:0.1570246
> >>> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
> http://en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403
> >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
> http://en.wikipedia.org/wiki/Chemistry score:0.09323166
> >>> Duplicate:1APeDW0KQnWRv_8mihrz4A 
> >>> URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
> >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
> http://en.wikipedia.org/wiki/Formal_science score:0.08606046
> >>>
> >>> Duplicates for link:http://en.wikipedia.org/wiki/Formal_science
> >>> ************************************************
> >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
> http://en.wikipedia.org/wiki/Formal_science score:0.12439237
> >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
> http://en.wikipedia.org/wiki/Chemistry score:0.11299215
> >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
> http://en.wikipedia.org/wiki/Mathematical_logic score:0.107585154
> >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
> http://en.wikipedia.org/wiki/Crystallography score:0.07795183
> >>> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
> http://en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285
> >>>
> >>> Duplicates for link:http://en.wikipedia.org/wiki/Star
> >>> ************************************************
> >>> Duplicate:1APeDW0KQnWRv_8mihrz4A 
> >>> URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
> >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
> http://en.wikipedia.org/wiki/Crystallography score:0.15316588
> >>> Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:
> http://en.wikipedia.org/wiki/Cosmology score:0.123572096
> >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
> http://en.wikipedia.org/wiki/Chemistry score:0.1177105
> >>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
> http://en.wikipedia.org/wiki/Mathematical_logic score:0.11373919
> >>>
> >>> Duplicates for link:http://en.wikipedia.org/wiki/Chemistry
> >>> ************************************************
> >>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
> http://en.wikipedia.org/wiki/Chemistry score:0.13033955
> >>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
> http://en.wikipedia.org/wiki/Crystallography score:0.121021904
> >>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo
> >>
> >>
> >> Here you should probably strip the html tags, and solely index the text 
> in its own field. 
> >
> > -- 
> > You received this message because you are subscribed to a topic in the 
> Google Groups "elasticsearch" group.
> > To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/elasticsearch/rc580pOzYCs/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to 
> elasticsearc...@googlegroups.com <javascript:>.
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/a92beaad-05bf-431b-9a37-f51512f50aa8%40googlegroups.com
> .
> >
> > For more options, visit https://groups.google.com/d/optout.
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1842c492-29d8-4339-b490-2c7235535495%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to