Hi all,
Have a good day! I used these below code to append file in HDFS from a local file. The local file size is 85MB. The Hadoop cluster (CDH 5.4.2, hdfs 2.6, replica number is 3) has 140GB free. I have a while loop, in there I do FSDataOutputStream out = fs.append(outFile); out.write(buffer, 0, bytesRead); out.close(); Each time I append 1024 byte from local file to HDFS file, the above loop makes my cluster out of storage and my program couldn't finished yet. Here's the full code. import java.io.*; import java.net.URI; import java.net.URISyntaxException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; public class writeflushexisted { public static void main(String[] argv) throws IOException, URISyntaxException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(new URI( "hdfs://192.168.94.185:8020" ),conf); Path inFile = new Path("testdata.txt"); Path outFile = new Path("/myhdfs/testdata.txt"); File localFile = new File(inFile.toString()); // Read from and write to new file FileInputStream in = new FileInputStream(localFile); int i = 0; byte buffer[] = new byte[1024]; try { int bytesRead = 0; while ((bytesRead = in.read(buffer)) > 0) { FSDataOutputStream out = fs.append(outFile); out.write(buffer, 0, bytesRead); out.close(); i++; } } catch (IOException e) { System.out.println("Error while copying file: " + e.getMessage()); } finally { in.close(); System.out.println("Number of loop:" + i); } } } Here's the information before I run this code --------------------------------------------------------------- [hdfs@chdhost125 current]$ hadoop fs -df -h Filesystem Size Used Available Use% hdfs://chdhost185.vitaldev.com:8020 266.4 G 38.2 G 139.8 G 14% --------------------------------------------------------------- [hdfs@chdhost125 lib]$ hadoop fs -du -h / 67.7 M 1.3 G /hbase 0 0 /myhdfs 0 0 /solr 1.8 G 5.4 G /tmp 10.6 G 31.4 G /user Here's the information while above code was running --------------------------------------------------------------- Filesystem Size Used Available Use% hdfs://chdhost185.vitaldev.com:8020 266.4 G 170.2 G 95.9 G 64% --------------------------------------------------------------- [hdfs@chdhost125 lib]$ hadoop fs -du -h / 67.7 M 1.3 G /hbase 32.9 M 384 M /myhdfs 0 0 /solr 1.8 G 5.4 G /tmp 10.6 G 31.4 G /user After 10 minutes, my cluster is out of storage and my program throw exception with error: "Error while copying file: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.94.185:50010, 192.168.94.27:50010], original=[192.168.94.185:50010, 192.168.94.27:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration." So, why append file with little append size (1024 byte) make my cluster out of space (local file is 85MB, but hdfs consumes ~ 140GB to append file)? Is any problem with my code? I know that append file with small size is not recommend, but I just want to know the reason why hdfs consume so much space. Thanks and Regards, Quan Nguyen