Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

Luigi Caiazza Thu, 01 Feb 2018 12:46:06 -0800

Hi everyone.

Today I replaced in my code all the IntPoint insertions and range queries
with the NumericDocValuesField ones, since I did not find a way to update
the value of an IntPoint. After such replacements, the previous test cases
still work and the overall performance seems to be the same, so who cares!
:)


@ Diego: I ran a test in which I built two indices: a static one that
stores the first 100K documents of my collection in one shot, initializing
the NumericDocValuesField to -1; and the empty dynamic one. Then, I
simulated to crawl the first 1K documents, that I indexed two times: in the
static index by just updating the doc value corresponding to the current
doc ID to a positive integer, and in the dynamic one by writing the
document from scratch (as I did so far). Even with these small numbers, the
update of the static index is on average 60 times faster than the writing
of the dynamic index. Unfortunately, you got the point with your last
reply. I tried to submit a query to both the indices with two common terms:
"contact home" (of course in AND with the condition of a positive doc
value). The two returned top-10 lists are different. In my project, this
can be neglected at evaluation time, but is very relevant at crawling time
since I infer the importance of the discovered links to crawl in the next
cycles by counting how many times my collected pages are in top-k for the
input queries. This probably means that I have to rewrite the part of the
scoring function that counts the idf. Tomorrow I will check for a smart way
to do this, but please tell me if you already know what I have to do.

@ Alessandro: to be honest, I already have in mind a worst-case
implementation for that part of the problem and currently I did not focus
yet on it. Since I need to distinguish the experiments only at evaluation
time, in where I just need to know what pages were crawled by who and when,
I was thinking to store this information in an external data structure. Of
course, a chance to manage this association directly in Lucene would reduce
the amount of code to write. I will try to imagine a better solution also
for this issue, but still, if you have an idea that keeps simple the
queries at crawling time, that currently are my first priority, please let
me know.

Thank you again for your support.

Cheers.

2018-02-01 11:24 GMT+01:00 Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net>:

> Hi Luigi, I don't know much that part of Lucene, I would check blog posts
> and the code to understand if you can use NumericDocValues (my gut says
> yes).
>
> Also, I don't know if it is important, but please note that if you index
> all the documents at the beginning your scores will be different - since
> idf will be computed on all the documents that you have in the collection.
>
> Cheers,
> Diego
>
>
> From: solr-user@lucene.apache.org At: 01/31/18 20:12:16To:
> solr-user@lucene.apache.org
> Subject: Re: Searching for an efficient and scalable way to filter query
> results using non-indexed and dynamic range values
>
> Hi,
>
> first of all, thank you for your answers.
>
> @ Rick: the reason is that the set of pages that are stored into the disk
> represents just a static view of the Web, in order to let my experiments be
> fully replicable. My need is to run simulations of different crawlers on
> top of it, each working on those pages as if they are coming from the real
> Web. During a simulation, the crawler receives a set of unpredictable user
> queries from an external module. Then, it changes the visit priorities to
> the discovered but uncrawled pages according with the current top-k results
> for those queries, given the contents of the "crawled" pages so far.
> Moreover, distinct runs explore different parts of the Web graph and
> receive different user queries. That's why I need to build a separate index
> of crawled contents for each run. The observation is that, since I am
> working with a snapshot of the Web, my indexing process could be engineered
> such that all the Web pages are already stored in the indexer and a flag
> enables the retrievability of the page if it has been crawled in the
> current experiment. In this way, I save some time that I could use to
> augment the scale of the crawling simulation, and/or to run other
> experiments.
>
> @ Alessandro: your approach of using a static and a dynamic index and then
> to merge the results by means of query joins was what I had in mind at a
> first glance. It could still do the job, but you already highlighted a
> performance limitation on the static index. Moreover, even if I store just
> the IDs and the crawling cycles, also the dynamic index will still be
> populated by some million of entries as the experiment proceeds. The atomic
> updates were another opportunity that I investigated before asking your
> help, but since eventually they rewrite the entire document I was hoping to
> find a more efficient solution.
>
> @ Diego: your idea of using the NumericDocValues sounds interesting.
> Probably this is the solution, but, if I get the point, a NumericDocValues
> has some features in common with the IntPoint that I am currently using in
> my index [1]. Among them: the storage of primitive data types instead of
> strings only, and the storage on a data structure different than the
> inverted index. Now I am asking: is there a chance to use the IntPoint in
> the same way?
>
> Cheers.
>
> [1]
> https://lucene.apache.org/core/7_2_1/core/org/apache/
> lucene/document/IntPoint.html
>
> 2018-01-31 13:45 GMT+01:00 Rick Leir <rl...@leirtech.com>:
>
> > Luigi
> > Is there a reason for not indexing all of your on-disk pages? That seems
> > to be the first step. But I do not understand what your goal is.
> > Cheers -- Rick
> >
> > On January 30, 2018 1:33:27 PM EST, Luigi Caiazza <lcaiazz...@gmail.com>
> > wrote:
> > >Hello,
> > >
> > >I am working on a project that simulates a selective, large-scale
> > >crawling.
> > >The system adapts its behaviour according with some external user
> > >queries
> > >received at crawling time. Briefly, it analyzes the already crawled
> > >pages
> > >in the top-k results for each query, and prioritizes the visit of the
> > >discovered links accordingly. In a generic experiment, I measure the
> > >time
> > >units as the number of crawling cycles completed so far, i.e., with an
> > >integer value. Finally, I evaluate the experiment by analyzing the
> > >documents fetched over the crawling cycles. In this work I am using
> > >Lucene
> > >7.2.1, but this should not be an issue since I need just some
> > >conceptual
> > >help.
> > >
> > >In my current implementation, an experiment starts with an empty index.
> > >When a Web page is fetched during the crawling cycle *x*, the system
> > >builds
> > >a document with the URL as StringField, the title and the body as
> > >TextFields, and *x* as an IntPoint. When I get an external user query,
> > >I
> > >submit it  to get the top-k relevant documents crawled so far. When I
> > >need
> > >to retrieve the documents indexed from cycle *i* to cycle *j*, I
> > >execute a
> > >range query over this last IntPoint field. This strategy does the job,
> > >but
> > >of course the write operations take some hours overall for a single
> > >experiment, even if I crawl just half a million of Web pages.
> > >
> > >Since I am not crawling real-time data, but I am working over a static
> > >set
> > >of many billions of Web pages (whose contents are already stored on
> > >disk),
> > >I am investigating some opportunities to reduce the number of writes
> > >during
> > >an experiment. For instance, I could avoid to index everything from
> > >scratch
> > >for each run. I would be happy to index all the static contents of my
> > >dataset (i.e., URL, title and body of a Web page) once and for all.
> > >Then,
> > >for a single experiment, I would mark a document as crawled at cycle
> > >*x* without
> > >storing this information permanently, in order both to filter out the
> > >documents that in the current simulation have not been crawled when
> > >processing the external queries, and to still perform the range queries
> > >at
> > >evaluation time. Do you have any idea on how to do that?
> > >
> > >Thank you in advance for your support.
> >
> > --
> > Sorry for being brief. Alternate email is rickleir at yahoo dot com
>
>
>

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

Reply via email to