[ https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989519#comment-16989519 ]
Lucas Pauchard edited comment on NUTCH-2756 at 12/6/19 8:34 AM: ---------------------------------------------------------------- Hi, thanks for your fast response Here's the details you ask for: {panel:title=Hadoop version} Hadoop 3.1.1 Source code repository [https://github.com/apache/hadoop] -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c Compiled by leftnoteasy on 2018-08-02T04:26Z Compiled with protoc 2.5.0 From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0 This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.1.jar {panel} For the HDFS configuration, I'm not sure which file do you really need. So I put in the attachements the "hdfs-site.xml" and also the configuration files we changed because of memory issues we had. We did also changes in the log4j file but I don't think this file is important. [^hdfs-site.xml] [^hadoop-env.sh] [^mapred-site.xml] [^yarn-env.sh] [^yarn-site.xml] Unfortunately, we don't keep the logs from the job more than 2 days, so I can't give you them. But today we had again the same problem and here is the logs: {panel:title=job Parser logs} 2019-12-06 06:34:45,903 INFO parse.ParseSegment: ParseSegment: starting at 2019-12-06 06:34:45 2019-12-06 06:34:45,917 INFO parse.ParseSegment: ParseSegment: segment: crawloneokhttp/segment/20191206055043 2019-12-06 06:34:45,994 INFO client.RMProxy: Connecting to ResourceManager at jobmaster/79.137.20.6:8032 2019-12-06 06:34:46,223 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1575565527267_0231 2019-12-06 06:35:00,461 INFO input.FileInputFormat: Total input files to process : 6 2019-12-06 06:35:00,583 INFO mapreduce.JobSubmitter: number of splits:6 2019-12-06 06:35:00,686 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 2019-12-06 06:35:00,796 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1575565527267_0231 2019-12-06 06:35:00,797 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2019-12-06 06:35:00,927 INFO conf.Configuration: resource-types.xml not found 2019-12-06 06:35:00,928 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2019-12-06 06:35:00,976 INFO impl.YarnClientImpl: Submitted application application_1575565527267_0231 2019-12-06 06:35:01,006 INFO mapreduce.Job: The url to track the job: [http://x.x.x.x:y/proxy/application_1575565527267_0231/|http://jobmaster:8088/proxy/application_1575565527267_0231/] 2019-12-06 06:35:01,007 INFO mapreduce.Job: Running job: job_1575565527267_0231 2019-12-06 06:36:04,205 INFO mapreduce.Job: Job job_1575565527267_0231 running in uber mode : false 2019-12-06 06:36:04,207 INFO mapreduce.Job: map 0% reduce 0% 2019-12-06 06:36:33,548 INFO mapreduce.Job: map 19% reduce 0% 2019-12-06 06:36:35,670 INFO mapreduce.Job: map 33% reduce 0% 2019-12-06 06:36:36,675 INFO mapreduce.Job: map 41% reduce 0% 2019-12-06 06:36:39,688 INFO mapreduce.Job: map 60% reduce 0% 2019-12-06 06:36:40,692 INFO mapreduce.Job: map 78% reduce 0% 2019-12-06 06:36:41,697 INFO mapreduce.Job: map 85% reduce 0% 2019-12-06 06:36:42,702 INFO mapreduce.Job: map 93% reduce 0% 2019-12-06 06:36:43,706 INFO mapreduce.Job: map 100% reduce 0% 2019-12-06 06:36:49,727 INFO mapreduce.Job: map 100% reduce 33% 2019-12-06 06:37:00,763 INFO mapreduce.Job: map 100% reduce 83% 2019-12-06 06:37:01,767 INFO mapreduce.Job: map 100% reduce 100% 2019-12-06 06:37:01,772 INFO mapreduce.Job: Job job_1575565527267_0231 completed successfully 2019-12-06 06:37:01,850 INFO mapreduce.Job: Counters: 57 File System Counters FILE: Number of bytes read=108082746 FILE: Number of bytes written=219258714 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=115122772 HDFS: Number of bytes written=44282864 HDFS: Number of read operations=30 HDFS: Number of large read operations=0 HDFS: Number of write operations=42 Job Counters Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=6 Launched reduce tasks=7 Data-local map tasks=4 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=203550 Total time spent by all reduces in occupied slots (ms)=117558 Total time spent by all map tasks (ms)=203550 Total time spent by all reduce tasks (ms)=117558 Total vcore-milliseconds taken by all map tasks=203550 Total time spent by all reduce tasks (ms)=117558 Total vcore-milliseconds taken by all map tasks=203550 Total vcore-milliseconds taken by all reduce tasks=117558 Total megabyte-milliseconds taken by all map tasks=1250611200 Total megabyte-milliseconds taken by all reduce tasks=722276352 Map-Reduce Framework Map input records=13798 Map output records=13798 Map output bytes=108027516 Map output materialized bytes=108082926 Input split bytes=972 Combine input records=0 Combine output records=0 Reduce input groups=13798 Reduce shuffle bytes=108082926 Reduce input records=13798 Reduce output records=13798 Spilled Records=27596 Shuffled Maps =36 Failed Shuffles=0 Merged Map outputs=36 GC time elapsed (ms)=2638 CPU time spent (ms)=184400 Physical memory (bytes) snapshot=14585151488 Virtual memory (bytes) snapshot=63409967104 Total committed heap usage (bytes)=22992650240 Peak Map Physical memory (bytes)=1402261504 Peak Map Virtual memory (bytes)=5282394112 Peak Reduce Physical memory (bytes)=1077035008 Peak Reduce Virtual memory (bytes)=5297762304 ParserStatus success=13798 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=115121800 File Output Format Counters Bytes Written=0 2019-12-06 06:37:01,853 INFO parse.ParseSegment: ParseSegment: finished at 2019-12-06 06:37:01, elapsed: 00:02:15 {panel} Here's more logs from the namenode logs {panel:title=Logs creation and closing of the segment file} 2019-12-04 22:22:54,768 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202201_461385, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000 2019-12-04 22:22:54,909 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202202_461386, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/data 2019-12-04 22:22:54,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202203_461387, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/data 2019-12-04 22:23:00,489 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/data is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:00,493 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202204_461388, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/index 2019-12-04 22:23:00,506 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/index is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:00,515 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/data is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:00,517 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202205_461389, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/index 2019-12-04 22:23:01,558 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/index is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:01,563 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1074202201_461385 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000 2019-12-04 22:23:01,964 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000 is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:06,437 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202206_461390, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 2019-12-04 22:23:06,703 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202207_461391, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data 2019-12-04 22:23:06,722 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202208_461392, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data 2019-12-04 22:23:10,444 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202209_461393, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00001 2019-12-04 22:23:10,650 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202210_461394, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/data 2019-12-04 22:23:10,684 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202211_461395, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/data 2019-12-04 22:23:10,698 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202212_461396, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002 2019-12-04 22:23:10,896 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202213_461397, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/data 2019-12-04 22:23:10,933 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202214_461398, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/data 2019-12-04 22:23:11,790 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202215_461399, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00003 2019-12-04 22:23:11,814 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,819 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202216_461400, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index 2019-12-04 22:23:11,838 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,843 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,846 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202217_461401, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index 2019-12-04 22:23:11,888 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,896 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:12,027 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202218_461402, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/data 2019-12-04 22:23:12,060 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202219_461403, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index 2019-12-04 22:23:12,064 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202220_461404, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/data 2019-12-04 22:23:12,221 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1 2019-12-04 22:23:12,229 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202221_461405, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data 2019-12-04 22:23:12,258 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1 2019-12-04 22:23:12,284 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202222_461406, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data 2019-12-04 22:23:12,298 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1 2019-12-04 22:23:13,521 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202223_461407, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00005 2019-12-04 22:23:13,681 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202224_461408, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/data 2019-12-04 22:23:13,717 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202225_461409, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/data 2019-12-04 22:23:15,160 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/data is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,164 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202226_461410, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/index 2019-12-04 22:23:15,199 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/index is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,204 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/data is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,207 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202227_461411, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/index 2019-12-04 22:23:15,216 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/index is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,222 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00001 is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,312 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/data is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,317 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202228_461412, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/index 2019-12-04 22:23:15,337 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/index is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,344 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/data is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,347 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202229_461413, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/index 2019-12-04 22:23:15,367 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/index is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,372 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1074202212_461396 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002 2019-12-04 22:23:15,774 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002 is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:16,276 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/data is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,289 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202230_461414, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/index 2019-12-04 22:23:16,298 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/index is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,303 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/data is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,306 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202231_461415, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/index 2019-12-04 22:23:16,316 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/index is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,321 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00003 is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:18,100 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/data is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,104 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202232_461416, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/index 2019-12-04 22:23:18,113 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/index is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,118 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/data is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,120 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202233_461417, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/index 2019-12-04 22:23:18,130 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/index is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,135 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00005 is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 {panel} What I find strange when I take a look at this logs, is that the crawl_parse/part-r-0000x files are closed in the order for the 1,2,3 and 5 but the 4th is closed before all of them. Tell me if you need more informations. was (Author: lucasp): Hi, thanks for your fast response Here's the details you ask for: {panel:title=Hadoop version} Hadoop 3.1.1 Source code repository [https://github.com/apache/hadoop] -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c Compiled by leftnoteasy on 2018-08-02T04:26Z Compiled with protoc 2.5.0 From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0 This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.1.jar {panel} For the HDFS configuration, I'm not sure which file do you really need. So I put in the attachements the "hdfs-site.xml" and also the configuration files we changed because of memory issues we had. We did also changes in the log4j file but I don't think this file is important. [^hdfs-site.xml] [^hadoop-env.sh] [^mapred-site.xml] [^yarn-env.sh] [^yarn-site.xml] Unfortunately, we don't keep the logs from the job more than 2 days, so I can't give you them. But today we had again the same problem and here is the logs: {panel:title=job Parser logs} 2019-12-06 06:34:45,903 INFO parse.ParseSegment: ParseSegment: starting at 2019-12-06 06:34:45 2019-12-06 06:34:45,917 INFO parse.ParseSegment: ParseSegment: segment: crawloneokhttp/segment/20191206055043 2019-12-06 06:34:45,994 INFO client.RMProxy: Connecting to ResourceManager at jobmaster/79.137.20.6:8032 2019-12-06 06:34:46,223 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1575565527267_0231 2019-12-06 06:35:00,461 INFO input.FileInputFormat: Total input files to process : 6 2019-12-06 06:35:00,583 INFO mapreduce.JobSubmitter: number of splits:6 2019-12-06 06:35:00,686 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 2019-12-06 06:35:00,796 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1575565527267_0231 2019-12-06 06:35:00,797 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2019-12-06 06:35:00,927 INFO conf.Configuration: resource-types.xml not found 2019-12-06 06:35:00,928 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2019-12-06 06:35:00,976 INFO impl.YarnClientImpl: Submitted application application_1575565527267_0231 2019-12-06 06:35:01,006 INFO mapreduce.Job: The url to track the job: [http://x.x.x.x:y/proxy/application_1575565527267_0231/|http://jobmaster:8088/proxy/application_1575565527267_0231/] 2019-12-06 06:35:01,007 INFO mapreduce.Job: Running job: job_1575565527267_0231 2019-12-06 06:36:04,205 INFO mapreduce.Job: Job job_1575565527267_0231 running in uber mode : false 2019-12-06 06:36:04,207 INFO mapreduce.Job: map 0% reduce 0% 2019-12-06 06:36:33,548 INFO mapreduce.Job: map 19% reduce 0% 2019-12-06 06:36:35,670 INFO mapreduce.Job: map 33% reduce 0% 2019-12-06 06:36:36,675 INFO mapreduce.Job: map 41% reduce 0% 2019-12-06 06:36:39,688 INFO mapreduce.Job: map 60% reduce 0% 2019-12-06 06:36:40,692 INFO mapreduce.Job: map 78% reduce 0% 2019-12-06 06:36:41,697 INFO mapreduce.Job: map 85% reduce 0% 2019-12-06 06:36:42,702 INFO mapreduce.Job: map 93% reduce 0% 2019-12-06 06:36:43,706 INFO mapreduce.Job: map 100% reduce 0% 2019-12-06 06:36:49,727 INFO mapreduce.Job: map 100% reduce 33% 2019-12-06 06:37:00,763 INFO mapreduce.Job: map 100% reduce 83% 2019-12-06 06:37:01,767 INFO mapreduce.Job: map 100% reduce 100% 2019-12-06 06:37:01,772 INFO mapreduce.Job: Job job_1575565527267_0231 completed successfully 2019-12-06 06:37:01,850 INFO mapreduce.Job: Counters: 57 File System Counters FILE: Number of bytes read=108082746 FILE: Number of bytes written=219258714 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=115122772 HDFS: Number of bytes written=44282864 HDFS: Number of read operations=30 HDFS: Number of large read operations=0 HDFS: Number of write operations=42 Job Counters Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=6 Launched reduce tasks=7 Data-local map tasks=4 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=203550 Total time spent by all reduces in occupied slots (ms)=117558 Total time spent by all map tasks (ms)=203550 Total time spent by all reduce tasks (ms)=117558 Total vcore-milliseconds taken by all map tasks=203550 Total time spent by all reduce tasks (ms)=117558 Total vcore-milliseconds taken by all map tasks=203550 Total vcore-milliseconds taken by all reduce tasks=117558 Total megabyte-milliseconds taken by all map tasks=1250611200 Total megabyte-milliseconds taken by all reduce tasks=722276352 Map-Reduce Framework Map input records=13798 Map output records=13798 Map output bytes=108027516 Map output materialized bytes=108082926 Input split bytes=972 Combine input records=0 Combine output records=0 Reduce input groups=13798 Reduce shuffle bytes=108082926 Reduce input records=13798 Reduce output records=13798 Spilled Records=27596 Shuffled Maps =36 Failed Shuffles=0 Merged Map outputs=36 GC time elapsed (ms)=2638 CPU time spent (ms)=184400 Physical memory (bytes) snapshot=14585151488 Virtual memory (bytes) snapshot=63409967104 Total committed heap usage (bytes)=22992650240 Peak Map Physical memory (bytes)=1402261504 Peak Map Virtual memory (bytes)=5282394112 Peak Reduce Physical memory (bytes)=1077035008 Peak Reduce Virtual memory (bytes)=5297762304 ParserStatus success=13798 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=115121800 File Output Format Counters Bytes Written=0 2019-12-06 06:37:01,853 INFO parse.ParseSegment: ParseSegment: finished at 2019-12-06 06:37:01, elapsed: 00:02:15 {panel} Here's more logs from the namenode logs {panel:title=Logs creation and closing of the segment file} 2019-12-04 22:22:54,768 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202201_461385, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000 2019-12-04 22:22:54,909 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202202_461386, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/data 2019-12-04 22:22:54,934 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202203_461387, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/data 2019-12-04 22:23:00,489 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/data is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:00,493 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202204_461388, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/index 2019-12-04 22:23:00,506 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00000/index is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:00,515 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/data is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:00,517 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202205_461389, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/index 2019-12-04 22:23:01,558 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00000/index is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:01,563 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1074202201_461385 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000 2019-12-04 22:23:01,964 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00000 is closed by DFSClient_attempt_1575479127636_0046_r_000000_0_1430165290_1 2019-12-04 22:23:06,437 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202206_461390, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 2019-12-04 22:23:06,703 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202207_461391, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data 2019-12-04 22:23:06,722 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202208_461392, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data 2019-12-04 22:23:10,444 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202209_461393, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00001 2019-12-04 22:23:10,650 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202210_461394, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/data 2019-12-04 22:23:10,684 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202211_461395, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/data 2019-12-04 22:23:10,698 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202212_461396, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002 2019-12-04 22:23:10,896 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202213_461397, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/data 2019-12-04 22:23:10,933 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202214_461398, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/data 2019-12-04 22:23:11,790 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202215_461399, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00003 2019-12-04 22:23:11,814 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,819 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202216_461400, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index 2019-12-04 22:23:11,838 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,843 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,846 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202217_461401, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index 2019-12-04 22:23:11,888 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:11,896 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 is closed by DFSClient_attempt_1575479127636_0046_r_000004_0_836874622_1 2019-12-04 22:23:12,027 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202218_461402, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/data 2019-12-04 22:23:12,060 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202219_461403, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index 2019-12-04 22:23:12,064 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202220_461404, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/data 2019-12-04 22:23:12,221 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/index is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1 2019-12-04 22:23:12,229 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202221_461405, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data 2019-12-04 22:23:12,258 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1 2019-12-04 22:23:12,284 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202222_461406, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data 2019-12-04 22:23:12,298 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00004/data is closed by DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1 2019-12-04 22:23:13,521 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202223_461407, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00005 2019-12-04 22:23:13,681 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202224_461408, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/data 2019-12-04 22:23:13,717 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202225_461409, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/data 2019-12-04 22:23:15,160 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/data is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,164 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202226_461410, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/index 2019-12-04 22:23:15,199 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00001/index is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,204 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/data is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,207 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202227_461411, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/index 2019-12-04 22:23:15,216 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00001/index is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,222 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00001 is closed by DFSClient_attempt_1575479127636_0046_r_000001_0_-1169959789_1 2019-12-04 22:23:15,312 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/data is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,317 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202228_461412, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/index 2019-12-04 22:23:15,337 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00002/index is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,344 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/data is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,347 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202229_461413, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/index 2019-12-04 22:23:15,367 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00002/index is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:15,372 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1074202212_461396 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002 2019-12-04 22:23:15,774 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00002 is closed by DFSClient_attempt_1575479127636_0046_r_000002_0_74990015_1 2019-12-04 22:23:16,276 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/data is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,289 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202230_461414, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/index 2019-12-04 22:23:16,298 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00003/index is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,303 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/data is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,306 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202231_461415, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/index 2019-12-04 22:23:16,316 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00003/index is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:16,321 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00003 is closed by DFSClient_attempt_1575479127636_0046_r_000003_0_489729318_1 2019-12-04 22:23:18,100 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/data is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,104 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202232_461416, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/index 2019-12-04 22:23:18,113 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_text/part-r-00005/index is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,118 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/data is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,120 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1074202233_461417, replicas=x.x.x.x:y, x.x.x.x:y for /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/index 2019-12-04 22:23:18,130 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00005/index is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 2019-12-04 22:23:18,135 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00005 is closed by DFSClient_attempt_1575479127636_0046_r_000005_0_1606887826_1 {panel} Tell me if you need more informations. > Segment Part problem with HDFS on distibuted mode > ------------------------------------------------- > > Key: NUTCH-2756 > URL: https://issues.apache.org/jira/browse/NUTCH-2756 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.15 > Reporter: Lucas Pauchard > Priority: Major > Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh, > hdfs-site.xml, mapred-site.xml, yarn-env.sh, yarn-site.xml > > > During the parsing, it happens sometimes that parts of the data on the HDFS > is missing after the parsing. > When I take a look at our HDFS, I've got this file with 0 bytes (see > attachments). > After that the CrawlDB complains about this specific (corrupted?) part: > {panel:title=log_crawl} > 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : > attempt_1575479127636_0047_m_000017_2, Status : FAILED > Error: java.io.EOFException: > hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~ > {panel} > When I check the namenode logs, I don't see any error during the writing of > the segment part but one hour later, I've got the following log: > {panel:title=log_namenode} > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 2], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index > closed. > 2019-12-04 23:23:13,750 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending > creates: 1], > src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 > closed. > {panel} > This issue is hard to reproduce and I can't figure out what are the > preconditions. It seems that it just happens randomly. > Maybe the problem is coming from a bad management when we close the file. -- This message was sent by Atlassian Jira (v8.3.4#803005)