[ https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000995#comment-17000995 ]
Sebastian Nagel commented on NUTCH-2756: ---------------------------------------- Hi [~lucasp], thanks for the notice! Ugly error and ideally a task which is killed because the JVM occupies more memory than allowed by mapreduce.*.memory.mb should fail the job. A mystery why this didn't happen. But in any case, not a Nutch issue. > Segment Part problem with HDFS on distibuted mode > ------------------------------------------------- > > Key: NUTCH-2756 > URL: https://issues.apache.org/jira/browse/NUTCH-2756 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.15 > Reporter: Lucas Pauchard > Priority: Major > Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh, > hdfs-site.xml, mapred-site.xml, syslog, yarn-env.sh, yarn-site.xml > > > During the parsing, it happens sometimes that parts of the data on the HDFS > is missing after the parsing. > When I take a look at our HDFS, I've got this file with 0 bytes (see > attachments). > After that the CrawlDB complains about this specific (corrupted?) part: > {panel:title=log_crawl} > 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : > attempt_1575479127636_0047_m_000017_2, Status : FAILED > Error: java.io.EOFException: > hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~ > {panel} > When I check the namenode logs, I don't see any error during the writing of > the segment part but one hour later, I've got the following log: > {panel:title=log_namenode} > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 2], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > closed. > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 1], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > closed. > {panel} > This issue is hard to reproduce and I can't figure out what are the > preconditions. It seems that it just happens randomly. > Maybe the problem is coming from a bad management when we close the file. -- This message was sent by Atlassian Jira (v8.3.4#803005)