Re: cannot find nutch logs in distributed mode

Sebastian Nagel Tue, 01 Aug 2017 07:52:16 -0700

Hi Srini,

> I am referring to the INFO messages that are printed in console when nutch
> 1.14 is running in distributed mode. For example


Afaics, the only way to get the logs of the job client is to redirect the 
console output to a file,
e.g.,

/mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb 
seed.txt &>inject.log

> I am running nutch from a EMR cluster.

If you're interested in the logs of task attempts, see:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html


Sebastian

On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
> Hi Sebastin
> 
> I am referring to the INFO messages that are printed in console when nutch
> 1.14 is running in distributed mode. For example
> 
> Injecting seed URLs
> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> seed.txt
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
> 06:51:18
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
> /user/hadoop/crawlDIR/crawldb
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
> to crawl db entries.
> 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
> ip-*-*-*-*.ec2.internal/*.*.*.*:8032
> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
> : 0
> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
> : 1
> .
> .
> 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
> 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running in
> uber mode : false
> 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
> 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
> 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
> 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
> 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
> 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
> 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
> 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
> 
> I am running nutch from a EMR cluster. I did check around the log
> directories and I dont see the messages i see in the console anywhere else.
> 
> One more thing i noticed is when i issue the command
> 
> *ps -ef | grep nutch*
> 
> hadoop    21616  18344  2 06:59 pts/1    00:00:09
> /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
> -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/hadoop/logs*
> *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
> -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
> -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
> org.apache.hadoop.util.RunJar
> /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
> org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
> mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
> mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
> 
> The logger mentioned in the running process is console. How do i change it
> to the log file rotated by log4j ?
> 
> i tried modifying the conf/log4j.properties file to use DRFA instead
> of cmdstdout logger. but that did not help either.
> 
> Any help would be appreciated.
> 
> Thanks
> Srini
> 
> On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
> wastl.na...@googlemail.com> wrote:
> 
>> Hi Srini,
>>
>> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
>> task logs.
>> The configuration whether, how long and where these logs are kept depends
>> on the
>> configuration of your Hadoop cluster.  You can easily find tutorials and
>> examples
>> how to configure this if you google for "hadoop task logs".
>>
>> Be careful the Nutch logs are usually huge.  The easiest way to get them
>> for a jobs
>> is to run the following command on the master node:
>>
>>   yarn logs -applicationId <app_id>
>>
>> Best,
>> Sebastian
>>
>> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
>>> Hi
>>>
>>> I am running nutch in distributed mode. I would like to see all nuch logs
>>> written to files. I only see the console output. Can i see the same
>>> information logged to some log files ?
>>>
>>> When i run nutch in local mode i do see the logs in runtime/local/logs
>>> directory. But when i run nutch in distributed mode, i dont see it
>> anywhere
>>> except console.
>>>
>>> Can anyone help me with the settings that i need to change ?
>>>
>>> Thanks
>>> Srini
>>>
>>
>>
>

Re: cannot find nutch logs in distributed mode

Reply via email to