Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-02-02 Thread Alessandro Benedetti
1) Diego's observation about IDF is absolutely correct here, but I don't think he was pointing it to be a negative aspect of your new approach. I think he just wanted to warn you about this. The way BM25 uses the IDF feature of a term is to estimate how important is the term in the context (

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-02-01 Thread Luigi Caiazza
Hi everyone. Today I replaced in my code all the IntPoint insertions and range queries with the NumericDocValuesField ones, since I did not find a way to update the value of an IntPoint. After such replacements, the previous test cases still work and the overall performance seems to be the same,

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-02-01 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi Luigi, I don't know much that part of Lucene, I would check blog posts and the code to understand if you can use NumericDocValues (my gut says yes). Also, I don't know if it is important, but please note that if you index all the documents at the beginning your scores will be different -

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-02-01 Thread Emir Arnautović
Hi, I did not check it in code, but based on earlier comments on ML, it seems that in place updates are not as it sounds - it will rewrite doc values for the segment that is updated. If you really want to avoid index changes, you can maybe use external field:

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-02-01 Thread Alessandro Benedetti
Reading from the wiki [1]: " An atomic update operation is performed using this approach only when the fields to be updated meet these three conditions: are non-indexed (indexed="false"), non-stored (stored="false"), single valued (multiValued="false") numeric docValues (docValues="true")

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Luigi Caiazza
Hi, first of all, thank you for your answers. @ Rick: the reason is that the set of pages that are stored into the disk represents just a static view of the Web, in order to let my experiments be fully replicable. My need is to run simulations of different crawlers on top of it, each working on

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Rick Leir
Luigi Is there a reason for not indexing all of your on-disk pages? That seems to be the first step. But I do not understand what your goal is. Cheers -- Rick On January 30, 2018 1:33:27 PM EST, Luigi Caiazza wrote: >Hello, > >I am working on a project that simulates a

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Alessandro Benedetti
I am not sure I fully understood your use case, but let me suggest few different possible solutions : 1) Query Time join approach : you keep 2 collections, one static with all the pages, one that just store lighweight documents containing the crawling interaction : 1) Id, content -> Pages

Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-30 Thread Luigi Caiazza
Hello, I am working on a project that simulates a selective, large-scale crawling. The system adapts its behaviour according with some external user queries received at crawling time. Briefly, it analyzes the already crawled pages in the top-k results for each query, and prioritizes the visit of