Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi
HDFS has a default replication factor of 3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens
Thanks, the issue was indeed the dfs replication factor. To fix it without entirely clearing out HDFS and rebooting, I first ran hdfs dfs -setrep -R -w 1 / to reduce all the current files' replication factor to 1 recursively from the root, then I changed the dfs.replication factor in

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node goes down the data is lost - in case that's not obvious. On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens wrote: > Thanks, the issue was indeed the dfs replication factor. To fix it without > entirely

Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread AlexG
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster with 16.73 Tb storage, using distcp. The dataset is a collection of tar files of about 1.7 Tb each. Nothing else was stored in the HDFS, but after completing the download, the namenode page says that 11.59 Tb are in use.

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers
what is your hdfs replication set to? On Wed, Nov 25, 2015 at 1:31 AM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 > cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of about 1.7 Tb each. >

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG: Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 11.4TB. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched