Hi Sebastian, thanks for the response.

The numbers I gave were for a single reduce task, not a whole job. I'll try to 
give a better picture.

crawldb/current has 161.4 gbytes of data, on about 1.6 billion urls. I don't 
know how many hosts or domains, but I assume it is many millions.

Cluster currently has 6 worker nodes, each with 32 GB RAM, 1TB SSD and 4 TBB 
HDD. Crawldb, linkdb and OS are on SSD. Other hdfs directories are on HD.

The generate jobs are given 16 reduces and a split size of 512 MB which yields 
about 330 maps.


The per-reduce-task spilled records is about 100 million. For the whole job, 
there are about 3.1 billion spilled records, half attributed to maps, half to 
reduces.While the job totals about 1.6 billion map input records and map output 
records, there are just 6.5 million reduce input records.

Does the number of spilled records make sense for a job this size, or is it 
something I should try to decrease? Where I'm coming from is that my dedicated 
Nutch cluster is disk-bound much of the time and I'm wondering if there are 
ways to do more in memory to save on disk IO.





On Friday, April 13, 2018, 1:19:53 AM PDT, Sebastian Nagel <> wrote: 





Hi Michael,

> reducer spills a lot of records

The job counter "Spilled Records" is not for the reducers alone.

> 255K input records

Does your CrawlDb only contain 250,000 entries?

Also, how many hosts (resp. domains/ips depending on partition.url.mode)
are in the CrawlDb? Note: the counts per host/domain/ip are kept in a
HashMap, that does not scale up to 100 millions of hosts.

> 100 million spilled records
> 13G file bytes written

With these, my estimation would be 10s or 100s millions of CrawlDb items.
Something is wrong if the CrawlDb is really so small.

> -D generate.update.crawldb=true

That's expensive if your CrawlDb is large.

> Do I have to increase  mapreduce.reduce.java.opts and 
> mapreduce.reduce.memory.mb?

If it's about a large number of hosts, this might help.
But you could also try to make sure that all data (HDFS and temporary) is on 
SSDs
try different compression settings (CrawlDb and temporary data), see
  mapreduce.output.fileoutputformat.compress.codec
  mapreduce.map.output.compress
  mapreduce.map.output.compress.codec

Best,
Sebastian


On 04/13/2018 02:52 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I would like to make my generate jobs go faster, and I see that the reducer 
> spills a lot of records.
> Here are the numbers for a typical long-running reduce task of the 
> generate-select job: 100 million spilled records, 255K input records, 90k 
> output records, 13G file bytes written, only 3G committed heap usage. 
> mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
> Do I have to increase  mapreduce.reduce.java.opts and 
> mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? 
> Also, are there other settings changes needed?
> My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  
> -D mapreduce.job.reduces=16 -D 
> mapreduce.input.fileinputformat.split.minsize=536870912 -D 
> mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D 
> db.fetch.interval.default=5184000 -D 
> db.fetch.schedule.adaptive.min_interval=3888000 -D 
> generate.update.crawldb=true  -D generate.max.count=25 
> /crawls/popular/data/crawldb /crawls/popular/data/segments/ -topN 60000 
> -numFetchers 2 -noFilter -maxNumSegments 24
> 
> 
> 
> 

Reply via email to