[ https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992323#comment-16992323 ]
Lucas Pauchard edited comment on NUTCH-2756 at 12/10/19 8:52 AM: ----------------------------------------------------------------- Hi [~snagel], This time, the problem happened on the partition 1 (<segment>/parse_text/part-r-00001/data). {panel:title=Log namenode} /user/hadoop/crawloneokhttp/segment/20191210055117/parse_text/part-r-00001/data is closed by DFSClient_attempt_1575911127307_0231_r_000001_0_1139952023_ {panel} The task log writing this part gave me the following stderr: {panel:title=errlog} Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class Dec 10, 2019 6:33:34 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.19 02/11/2015 03:25 AM' Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton" Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton" Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest" log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster). log4j:WARN Please initialize the log4j system properly. log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig] for more info. {panel} And the syslog of the task {panel:title=syslog} [^syslog] {panel} As you can see there is a "Container killed by the ApplicationMaster". So maybe we still have some memory issues. So I did what you said: {quote}is the state reproducible by running the parser job on the same segment again? Remove the subdirectories crawl_parse, parse_data and parse_text and run the parser job again {quote} And this time I didn't have any issues. I'll try to change the memory parameters as you said: {quote}One point (although hardly related to the problem): the task memory defined by mapreduce.*.memory.mb should be higher than the Java -Xmx in mapreduce.*.java.opts {quote} and see if it happens again was (Author: lucasp): Hi [~snagel], This time, the problem happened on the partition 1 (<segment>/parse_text/part-r-00001/data). {panel:title=Log namenode} /user/hadoop/crawloneokhttp/segment/20191210055117/parse_text/part-r-00001/data is closed by DFSClient_attempt_1575911127307_0231_r_000001_0_1139952023_ {panel} The task log writing this part gave me the following stderr: {panel:title=errlog} Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class Dec 10, 2019 6:33:34 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.19 02/11/2015 03:25 AM' Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton" Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton" Dec 10, 2019 6:33:34 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest" log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster). log4j:WARN Please initialize the log4j system properly. log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig] for more info. {panel} {panel:title=syslog} [^syslog] {panel} As you can see there is a "Container killed by the ApplicationMaster". So maybe we still have some memory issues. So I did what you said: {quote}is the state reproducible by running the parser job on the same segment again? Remove the subdirectories crawl_parse, parse_data and parse_text and run the parser job again {quote} And this time I didn't have any issues. I'll try to change the memory parameters as you said: {quote}One point (although hardly related to the problem): the task memory defined by mapreduce.*.memory.mb should be higher than the Java -Xmx in mapreduce.*.java.opts {quote} and see if it happens again > Segment Part problem with HDFS on distibuted mode > ------------------------------------------------- > > Key: NUTCH-2756 > URL: https://issues.apache.org/jira/browse/NUTCH-2756 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.15 > Reporter: Lucas Pauchard > Priority: Major > Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh, > hdfs-site.xml, mapred-site.xml, syslog, yarn-env.sh, yarn-site.xml > > > During the parsing, it happens sometimes that parts of the data on the HDFS > is missing after the parsing. > When I take a look at our HDFS, I've got this file with 0 bytes (see > attachments). > After that the CrawlDB complains about this specific (corrupted?) part: > {panel:title=log_crawl} > 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : > attempt_1575479127636_0047_m_000017_2, Status : FAILED > Error: java.io.EOFException: > hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~ > {panel} > When I check the namenode logs, I don't see any error during the writing of > the segment part but one hour later, I've got the following log: > {panel:title=log_namenode} > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 2], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > closed. > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 1], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > closed. > {panel} > This issue is hard to reproduce and I can't figure out what are the > preconditions. It seems that it just happens randomly. > Maybe the problem is coming from a bad management when we close the file. -- This message was sent by Atlassian Jira (v8.3.4#803005)