> David Larochelle wrote:
> > Currently, the driver process periodically queries a database to get a
> > list of URLs to [crawl]. It then stores these url's to be downloaded
> > in a complex in memory [structure?] and pipes them to separate
> > processes that do the actual downloading.
> >
> > The problem is that the database queries are slow and block the driver
> > process.    
> 
> Your description leaves out a lot of details about what sort of data is
being
> passed back and forth between the different processing stages.

I just want to point out that it seems to me that everyone looking at this
may be excessively Perl-centric in their solutioning, and that neither the
problem nor the solution need lay in that vicinity. David says quite plainly
"The problem is that the database queries are slow..." -- so let's consider
that.

If the database queries are slow, then perhaps the solution is to make
faster database queries -- or rather, to make database queries that get data
faster. It is true that there is not a lot of detail in this problem
statement, but I shall assume this is a SQL database -- not because it's
what I would do -- I would not -- but because it's what most people seem
inclined to do.

Instead of starting off with a query such as "select url from somewhere" to
return all of the URLs, start off with a limited query to rapidly get enough
results to feed the pipeline, such as "select url from somewhere limit 20".
This query will stop when it has those 20 URLs and those can be used to kick
off a number of download processes. If the downloads will take quite a long
time, then you might leap directly to the longer, full query; otherwise you
might next run a query with "... limit 200" or so -- grabbing as many
results as you can in about the time it will take to process the first
batch.  Once you have enough results to keep your downloading active for the
duration of the full query, you can run it.  Depending on the various
timings, this could involve a number of incremental increases, but for
reasonably asymmetric rates, only a few.

You will note that these queries will include the prior results -- asking
for 200 will include the first 20 -- but you wouldn't want to repeat those.
"Obviously", you should just skip the first 20 items in the second result
and the first 200 of the third result -- assuming you're getting the data in
a consistent order, and assuming that new data has not been inserted between
queries. Determining the best approach here depends a lot on your database
schema and usage.  You could simply keep track locally of the past several
hundred URLs you've processed, and skip anything which is found in that
bucket -- perhaps storing a timestamp to simplify defecting out-of-date
entries.

This approach assumes that your query is not inherently inefficient -- if
your query is something like "select url from something order by
some_non-indexed_column", then limiting the results to the top 20 will not
help, because the ordering requires processing sorting the entire data set
prior to providing any results.  If that's what you're doing, then stop
doing that. (Ensure all returned columns are indexed, don't order the
results, don't use complex function in subordinate clauses, etc...)

In terms of the Perl aspects... if you really had _large_ data structures,
you would not be using Perl data structures -- they're simply too
inefficient.  You would be storing your data in a file with a section of
fixed-width records and, optionally, a section of offset-linked blocks to
hold unfixed-width data, and you would mmap that data and overlay a data
structure layout on the records to access data elements.  That is, you'd
implement something like a file-system. (While Perl gained the ability to
use mmap on some systems a while ago, it is problematic regarding the
_efficient_ use of such techniques, somewhat defeating the purpose.) The
Perl-friendliest approach akin to this would be to store data within an
existing file-system, and to break it up into units which optimize both
access and utility -- don't store just one URL per file, but also don't
store all the URLs in a single file.  Store enough URLs in any given file
that the time it takes to read the file into Perl data is negligible
compared to the time it will take to download those URLs. This can also be
used to facilitate a crash-recovery strategy, making a single file of URLs
the transactional unit of work to be restarted on failure.

Also remember that writing data is still much slower than reading it, so it
may be much more efficient for you to write results as a second, smaller
file which you keep adjacent to the first one, rather than updating and then
re-persisting any data structure containing the URLs.  This would generally
be true if any results were significantly smaller than the input URL data --
and in the trivial case where the result is simply "completion", there is no
need to write any data -- simply rename the file to indicate the change in
status.

One could rant on, but I'll stop here and broadly say that when looking at
issues of performance, it is necessary to consider the entire system -- both
to find inefficient implementations, but also to find inherent
inefficiencies in the problem itself -- and to determine if the problem
itself can be expressed in a more machine-tractable formulation.




_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to