Some more observations
4. When 'mapreduce.framework.name' is set to yarn, counters are present for the
Injector job. The following is example output showing all of the Injector job
counters.
2020-12-21 12:13:58,674 INFO mapreduce.Job: Job job_1608581566698_0001
completed successfully
2020-12-21 12:13:58,760 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=1456826
FILE: Number of bytes written=3699396
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1163333
HDFS: Number of bytes written=794148
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=19002
Total time spent by all reduces in occupied slots (ms)=7305
Total time spent by all map tasks (ms)=6334
Total time spent by all reduce tasks (ms)=2435
Total vcore-milliseconds taken by all map tasks=6334
Total vcore-milliseconds taken by all reduce tasks=2435
Total megabyte-milliseconds taken by all map tasks=9501000
Total megabyte-milliseconds taken by all reduce tasks=3652500
Map-Reduce Framework
Map input records=22845
Map output records=22765
Map output bytes=1411290
Map output materialized bytes=1456832
Input split bytes=572
Combine input records=0
Combine output records=0
Reduce input groups=11322
Reduce shuffle bytes=1456832
Reduce input records=22765
Reduce output records=11322
Spilled Records=45530
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=114
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1046478848
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
injector
urls_filtered=80
urls_injected=11443
urls_merged=11322
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=794148
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls rejected by
filters: 80
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected
after normalization and filtering: 11443
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected but
already in CrawlDb: 11322
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total new urls injected:
121
2020-12-21 12:13:58,794 INFO crawl.Injector: Injector: Total urls with status
gone removed from CrawlDb (db.update.purge.404): 0
2020-12-21 12:13:58,804 INFO crawl.Injector: Injector: finished at 2020-12-21
12:13:58, elapsed: 00:00:36
5. When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the
following runtimes
* 1st run: elapsed: 00:00:42
* 2nd run: elapsed: 00:00:13
* 3rd run: elapsed: 00:00:14
5. When 'mapreduce.framework.name' is set to 'yarn' I am observing the
following runtimes
* 1st run: elapsed: 00:00:34
* 2nd run: elapsed: 00:00:32
* 3rd run: elapsed: 00:00:34
After the first DAG run, it looks like there is a marked runtime improvement
running the Injector job on Tez. It would be great if I could somehow explain
this but my Tez knowledge is still very limited. I will keep digging though!
I'm also going to create a tutorial on the Nutch wiki covering this entire
experience. I'll maybe pull together a YouTube video or something as well.
lewismc
On 2020/12/21 20:11:59, Lewis John McGibbney <[email protected]> wrote:
> Hi László,
> Thank you for the additional explanation. Adapting my configuration based on
> your suggestions results in successful job execution as DAG's now. A huge
> thank you :)
>
> A couple of notes
> 1. I was unable to use the Tez minimal distribution. I had to use the full
> 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar
> dependency in minimal distribution.
> 2. Looking at the syslog for the DAG, I have observed
> java.lang.NoSuchMethodException:
> java.nio.channels.ClosedByInterruptException. The full paste can be seen at
> https://paste.apache.org/mjw0c. Is this normal behavior?
> 3. We added some useful counters to the Injector job which are printed to the
> application log. Example is as follows
>
> 2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003
> completed successfully
> 2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by
> filters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected
> after normalization and filtering: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected
> but already in CrawlDb: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls
> injected: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status
> gone removed from CrawlDb (db.update.purge.404): 0
> 2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21
> 12:06:35, elapsed: 00:00:14
>
> As you can see, apparently the counters are not correctly representing the
> relevant entries in the newly created crawl database. I can verify this by
> running the following read database job
>
> nutch readdb crawldb -stats
>
> 2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> crawldb
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls: 11322
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval:
> 30 days, 00:00:00
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval: 30
> days, 00:00:00
> 2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval:
> 30 days, 00:00:00
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time:
> Mon Dec 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times: Mon Dec
> 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time: Mon Dec
> 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0: 11322
> 2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01:
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05:
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25:
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75:
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95:
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99:
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score: 1.0
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> 11322
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done
>
> I'm investigating this now.
>
> Again, thank you very much for your help.
>
> On 2020/12/21 00:04:36, László Bodor <[email protected]> wrote:
> > Hi!
> >
> > This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
> > here
> > <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>
> >
> > 1.
> > hdfs dfs -mkdir -p /apps/tez
> > hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez
> >
> ..
>