[ https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989868#comment-16989868 ]
Sebastian Nagel commented on NUTCH-2756: ---------------------------------------- Hi [~lucasp], > But today we had again the same problem Again on partition 4 ? > files are closed in the order for the 1,2,3 and 5 but the 4th is closed > before all of them. Closing should happen when as tasks is finished. I checked our namenode logs and got a more or less random ordering of partitions closed: 7, 55, 23, 85, ... In doubt, the time stamps in the task logs, eg. {noformat} 2019-12-03 05:00:27,375 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1575288763586_0015_r_000007_0 is done. And is in the process of committing {noformat} and the closing messages in the namenode logs should roughly correspond: {noformat} 2019-12-03 05:00:27,109 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/crawl_fetch/part-r-00007/data is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 2019-12-03 05:00:27,121 INFO hdfs.StateChange: BLOCK* allocate blk_1074004836_264337, replicas=10.67.67.123:9866, 10.67.67.85:9866, 10.67.67.62:9866 for /user/ubuntu/seedcrawl/segments/20191202224221/crawl_fetch/part-r-00007/index 2019-12-03 05:00:27,197 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/crawl_fetch/part-r-00007/index is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 2019-12-03 05:00:27,201 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/parse_text/part-r-00007/data is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 2019-12-03 05:00:27,204 INFO hdfs.StateChange: BLOCK* allocate blk_1074004837_264338, replicas=10.67.67.123:9866, 10.67.67.146:9866, 10.67.67.6:9866 for /user/ubuntu/seedcrawl/segments/20191202224221/parse_text/part-r-00007/index 2019-12-03 05:00:27,262 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/parse_text/part-r-00007/index is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 2019-12-03 05:00:27,279 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/parse_data/part-r-00007/data is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 2019-12-03 05:00:27,285 INFO hdfs.StateChange: BLOCK* allocate blk_1074004839_264340, replicas=10.67.67.123:9866, 10.67.67.178:9866, 10.67.67.5:9866 for /user/ubuntu/seedcrawl/segments/20191202224221/parse_data/part-r-00007/index 2019-12-03 05:00:27,331 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/parse_data/part-r-00007/index is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 2019-12-03 05:00:27,343 INFO hdfs.StateChange: DIR* completeFile: /user/ubuntu/seedcrawl/segments/20191202224221/crawl_parse/part-r-00007 is closed by DFSClient_attempt_1575288763586_0015_r_000007_0_1313775845_1 {noformat} > Segment Part problem with HDFS on distibuted mode > ------------------------------------------------- > > Key: NUTCH-2756 > URL: https://issues.apache.org/jira/browse/NUTCH-2756 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.15 > Reporter: Lucas Pauchard > Priority: Major > Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh, > hdfs-site.xml, mapred-site.xml, yarn-env.sh, yarn-site.xml > > > During the parsing, it happens sometimes that parts of the data on the HDFS > is missing after the parsing. > When I take a look at our HDFS, I've got this file with 0 bytes (see > attachments). > After that the CrawlDB complains about this specific (corrupted?) part: > {panel:title=log_crawl} > 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : > attempt_1575479127636_0047_m_000017_2, Status : FAILED > Error: java.io.EOFException: > hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~ > {panel} > When I check the namenode logs, I don't see any error during the writing of > the segment part but one hour later, I've got the following log: > {panel:title=log_namenode} > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 2], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > closed. > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 1], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > closed. > {panel} > This issue is hard to reproduce and I can't figure out what are the > preconditions. It seems that it just happens randomly. > Maybe the problem is coming from a bad management when we close the file. -- This message was sent by Atlassian Jira (v8.3.4#803005)