Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Alex Ksikes Thu, 08 May 2014 02:03:13 -0700

On May 8, 2014 8:09 AM, "Zoran Jeremic" <zoran.jere...@gmail.com> wrote:
>
> Hi Alex,
>
> Thank you for this explanation. This really helped me to understand how
it works, and now I managed to get results I was expecting just after
setting max_query_terms value to be 0 or some very high value. With these
results in my tests I was able to identify duplicates. I noticed couple of
things though.
>
> - I got much better results with web pages when I indexed attachment as
html source and use text extracted by Jsoup in query, then when I indexed
text extracted from web page as attachment and used text in query. I
suppose that difference is related to the fact that Jsoup did not extract
text in the same way as Tika parser used by ES did.
> - There was significant improvement in the results in the second test
when I have indexed 50 web pages, then in first test when I indexed 10 web
pages. I deleted index before each test. I suppose that this is related to
the tf*idf.
> If so, does it make sense to provide some training set for elasticsearch
that will be used to populate index before system is started to be used?


Perhaps you are asking for a background dataset to bias the selection of
interesting terms. This could make sense depending on your application.

> Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?
> I have the service that recommends documents to the students based on
their current learning context. It creates tokenized string from titles,
descriptions and keywords of the course lessons student is working at the
moment. I'm using this string as input to the mlt_like_text to find some
interesting resources that could help them.
> I want to avoid having duplicates (or very similar documents) among top
documents that are recommended.
> My idea was that during the documents uploading (before I index it with
elasticsearch) I find if there already exists it's duplicate, and store
this information as ES document field. Later, in query I can specify that
duplicates are not recommended.
>
> Here you should probably strip the html tags, and solely index the text
in its own field.
> As I already mentioned this didn't give me good results for some reason.
>
> Do you think this approach would work fine with large textual documents,
e.g. pdf documents having couple of hundred of pages? My main concern is
related to performances of these queries using like_text, so that's why I
was trying to avoid this approach and use mlt with document id as input.

I don't think this approach would work well in this case, but you should
try. I think what you are after is to either extract good features for your
PDF documents and search on that, or finger printing. This could be
achieved by playing with analyzers.

> Thanks,
> Zoran
>
>
>
> On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote:
>>
>> Hi Zoran,
>>
>> In a nutshell 'more like this' creates a large boolean disjunctive query
of 'max_query_terms' number of interesting terms from a text specified in
'like_text'. The interesting terms are picked up with respect to the their
tf-idf scores in the whole corpus. These later parameters could be tuned
with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The
number of boolean clauses that must match is controlled by
'percent_terms_to_match'. In the case of specifying only one field in
'fields', the analyzer used to pick up the terms in 'like_text' is the one
associated with the field, unless specified specified by 'analyzer'. So as
an example, the default is to create a boolean query of 25 interesting
terms where only 30% of the should clauses must match.
>>
>> On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:
>>>
>>> Hi Alex,
>>>
>>>
>>> If you are looking for exact duplicates then hashing the file content,
and doing a search for that hash would do the job.
>>> This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.
>>> I have the other service that uses mlt_like_text to recommend some
relevant documents, and my problem is if this document has best score, then
all duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.
>>
>>
>> Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?
>>
>>>
>>> If you are looking for near duplicates, then I would recommend
extracting whatever text you have in your html, pdf, doc, indexing that and
running more like this with like_text set to that content.
>>> I tried that as well, and results are very disappointing, though I'm
not sure if that would be good idea having in mind that long textual
documents could be used. For testing purpose, I made a simple test with 10
web pages. Maybe I'm making some mistake there. What I did is to index 10
web pages and store it in document as attachment. Content is stored as
byte[]. Then I'm using the same 10 pages, extract content using Jsoup, and
try to find similar web pages. Here is the code that I used to find similar
web pages to the provided one:
>>> System.out.println("Duplicates for link:"+link);
>>>
 System.out.println("************************************************");
>>>              String indexName=ESIndexNames.INDEX_DOCUMENTS;
>>>              String indexType=ESIndexTypes.DOCUMENT;
>>>              String mapping =
copyToStringFromClasspath("/org/prosolo/services/indexing/document-mapping.json");
>>>
 
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
>>>              URL url = new URL(link);
>>>             org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
>>>               String html=doc.html(); //doc.text();
>>>              QueryBuilder qb = null;
>>>              // create the query
>>>              qb = QueryBuilders.moreLikeThisQuery("file")
>>>                      .likeText(html).minTermFreq(0).minDocFreq(0);
>>>              SearchResponse sr =
client.prepareSearch(ESIndexNames.INDEX_DOCUMENTS)
>>>                      .setQuery(qb).addFields("url", "title",
"contentType")
>>>                      .setFrom(0).setSize(5).execute().actionGet();
>>>              if (sr != null) {
>>>                  SearchHits searchHits = sr.getHits();
>>>                  Iterator<SearchHit> hitsIter = searchHits.iterator();
>>>                  while (hitsIter.hasNext()) {
>>>                      SearchHit searchHit = hitsIter.next();
>>>                      System.out.println("Duplicate:" + searchHit.getId()
>>>                              + "
title:"+searchHit.getFields().get("url").getValue()+" score:" +
searchHit.getScore());
>>>                       }
>>>              }
>>>
>>> And results of the execution of this for each of 10 urls is:
>>>
>>> Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic
>>> ************************************************
>>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.3335998
>>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.16319205
>>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.13035104
>>> Duplicate:1APeDW0KQnWRv_8mihrz4A 
>>> URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
>>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.117023855
>>>
>>> Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics
>>> ************************************************
>>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.1570246
>>> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
http://en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403
>>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.09323166
>>> Duplicate:1APeDW0KQnWRv_8mihrz4A 
>>> URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
>>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.08606046
>>>
>>> Duplicates for link:http://en.wikipedia.org/wiki/Formal_science
>>> ************************************************
>>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.12439237
>>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.11299215
>>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.107585154
>>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.07795183
>>> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
http://en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285
>>>
>>> Duplicates for link:http://en.wikipedia.org/wiki/Star
>>> ************************************************
>>> Duplicate:1APeDW0KQnWRv_8mihrz4A 
>>> URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
>>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.15316588
>>> Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:
http://en.wikipedia.org/wiki/Cosmology score:0.123572096
>>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.1177105
>>> Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.11373919
>>>
>>> Duplicates for link:http://en.wikipedia.org/wiki/Chemistry
>>> ************************************************
>>> Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.13033955
>>> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.121021904
>>> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo
>>
>>
>> Here you should probably strip the html tags, and solely index the text
in its own field.
>
> --
> You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/rc580pOzYCs/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a92beaad-05bf-431b-9a37-f51512f50aa8%40googlegroups.com
.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMrXmPct6J%3DXwPzRwtXM1ngwpdzXxxhkQFGHxwL%3DNtsRcg11GA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to