[jira] [Comment Edited] (NUTCH-2756) Segment Part problem with HDFS on distibuted mode

Lucas Pauchard (Jira) Tue, 10 Dec 2019 00:53:42 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992323#comment-16992323
 ]


Lucas Pauchard edited comment on NUTCH-2756 at 12/10/19 8:52 AM:
-----------------------------------------------------------------

Hi [~snagel],

This time, the problem happened on the partition 1 
(<segment>/parse_text/part-r-00001/data).

 
{panel:title=Log namenode}
/user/hadoop/crawloneokhttp/segment/20191210055117/parse_text/part-r-00001/data 
is closed by DFSClient_attempt_1575911127307_0231_r_000001_0_1139952023_
{panel}
 

The task log writing this part gave me the following stderr:

 
{panel:title=errlog}
Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
 INFO: Registering 
org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider 
class
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
 INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a 
provider class
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
 INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a 
root resource class
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
 INFO: Initiating Jersey application, version 'Jersey: 1.19 02/11/2015 03:25 AM'
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
getComponentProvider
 INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to 
GuiceManagedComponentProvider with the scope "Singleton"
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
getComponentProvider
 INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to 
GuiceManagedComponentProvider with the scope "Singleton"
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
getComponentProvider
 INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to 
GuiceManagedComponentProvider with the scope "PerRequest"
 log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig] for 
more info.
{panel}
And the syslog of the task
{panel:title=syslog}
[^syslog]
{panel}
As you can see there is a "Container killed by the ApplicationMaster". So maybe 
we still have some memory issues.

So I did what you said:
{quote}is the state reproducible by running the parser job on the same segment 
again? Remove the subdirectories crawl_parse, parse_data and parse_text and run 
the parser job again
{quote}
And this time I didn't have any issues.

I'll try to change the memory parameters as you said:
{quote}One point (although hardly related to the problem): the task memory 
defined by mapreduce.\*.memory.mb should be higher than the Java -Xmx in 
mapreduce.\*.java.opts
{quote}
and see if it happens again


was (Author: lucasp):
Hi [~snagel],

This time, the problem happened on the partition 1 
(<segment>/parse_text/part-r-00001/data).

 
{panel:title=Log namenode}
/user/hadoop/crawloneokhttp/segment/20191210055117/parse_text/part-r-00001/data 
is closed by DFSClient_attempt_1575911127307_0231_r_000001_0_1139952023_
{panel}
 

The task log writing this part gave me the following stderr:

 
{panel:title=errlog}
Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
 INFO: Registering 
org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider 
class
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
 INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a 
provider class
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
 INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a 
root resource class
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
 INFO: Initiating Jersey application, version 'Jersey: 1.19 02/11/2015 03:25 AM'
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
getComponentProvider
 INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to 
GuiceManagedComponentProvider with the scope "Singleton"
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
getComponentProvider
 INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to 
GuiceManagedComponentProvider with the scope "Singleton"
 Dec 10, 2019 6:33:34 AM 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
getComponentProvider
 INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to 
GuiceManagedComponentProvider with the scope "PerRequest"
 log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig] for 
more info.
{panel}
And the syslog of the task
{panel:title=syslog}
[^syslog]
{panel}
As you can see there is a "Container killed by the ApplicationMaster". So maybe 
we still have some memory issues.

So I did what you said:
{quote}is the state reproducible by running the parser job on the same segment 
again? Remove the subdirectories crawl_parse, parse_data and parse_text and run 
the parser job again
{quote}
And this time I didn't have any issues.

I'll try to change the memory parameters as you said:
{quote}One point (although hardly related to the problem): the task memory 
defined by mapreduce.*.memory.mb should be higher than the Java -Xmx in 
mapreduce.*.java.opts
{quote}
and see if it happens again

> Segment Part problem with HDFS on distibuted mode
> -------------------------------------------------
>
>                 Key: NUTCH-2756
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2756
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Lucas Pauchard
>            Priority: Major
>         Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh, 
> hdfs-site.xml, mapred-site.xml, syslog, yarn-env.sh, yarn-site.xml
>
>
> During the parsing, it happens sometimes that parts of the data on the HDFS 
> is missing after the parsing.
> When I take a look at our HDFS, I've got this file with 0 bytes (see 
> attachments).
> After that the CrawlDB complains about this specific (corrupted?) part:
> {panel:title=log_crawl}
> 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : 
> attempt_1575479127636_0047_m_000017_2, Status : FAILED
> Error: java.io.EOFException: 
> hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
>  not a SequenceFile
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
>         at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
>         at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~
> {panel}
> When I check the namenode logs, I don't see any error during the writing of 
> the segment part but one hour later, I've got the following log:
> {panel:title=log_namenode}
> 2019-12-04 23:23:13,750 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending 
> creates: 2], 
> src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
> internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
> /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
>  closed.
> 2019-12-04 23:23:13,750 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending 
> creates: 1], 
> src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
> internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
> /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 
> closed.
> {panel}
> This issue is hard to reproduce and I can't figure out what are the 
> preconditions. It seems that it just happens randomly.
> Maybe the problem is coming from a bad management when we close the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (NUTCH-2756) Segment Part problem with HDFS on distibuted mode

Reply via email to