1) Diego's observation about IDF is absolutely correct here, but I don't
think he was pointing it to be a negative aspect of your new approach.
I think he just wanted to warn you about this.
The way BM25 uses the IDF feature of a term is to estimate how important is
the term in the context (
Hi everyone.
Today I replaced in my code all the IntPoint insertions and range queries
with the NumericDocValuesField ones, since I did not find a way to update
the value of an IntPoint. After such replacements, the previous test cases
still work and the overall performance seems to be the same,
Hi Luigi, I don't know much that part of Lucene, I would check blog posts and
the code to understand if you can use NumericDocValues (my gut says yes).
Also, I don't know if it is important, but please note that if you index all
the documents at the beginning your scores will be different -
Hi,
I did not check it in code, but based on earlier comments on ML, it seems that
in place updates are not as it sounds - it will rewrite doc values for the
segment that is updated. If you really want to avoid index changes, you can
maybe use external field:
Reading from the wiki [1]:
" An atomic update operation is performed using this approach only when the
fields to be updated meet these three conditions:
are non-indexed (indexed="false"), non-stored (stored="false"), single
valued (multiValued="false") numeric docValues (docValues="true")
Hi,
first of all, thank you for your answers.
@ Rick: the reason is that the set of pages that are stored into the disk
represents just a static view of the Web, in order to let my experiments be
fully replicable. My need is to run simulations of different crawlers on
top of it, each working on
Luigi
Is there a reason for not indexing all of your on-disk pages? That seems to be
the first step. But I do not understand what your goal is.
Cheers -- Rick
On January 30, 2018 1:33:27 PM EST, Luigi Caiazza wrote:
>Hello,
>
>I am working on a project that simulates a
I am not sure I fully understood your use case, but let me suggest few
different possible solutions :
1) Query Time join approach : you keep 2 collections, one static with all
the pages, one that just store lighweight documents containing the crawling
interaction :
1) Id, content -> Pages
Hello,
I am working on a project that simulates a selective, large-scale crawling.
The system adapts its behaviour according with some external user queries
received at crawling time. Briefly, it analyzes the already crawled pages
in the top-k results for each query, and prioritizes the visit of