Turning off replication sacrifices durability of your data, so if a node goes down the data is lost - in case that's not obvious. On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens <swift...@gmail.com> wrote:
> Thanks, the issue was indeed the dfs replication factor. To fix it without > entirely clearing out HDFS and rebooting, I first ran > hdfs dfs -setrep -R -w 1 / > to reduce all the current files' replication factor to 1 recursively from > the root, then I changed the dfs.replication factor in > ephemeral-hdfs/conf/hdfs-site.xml and ran ephemeral-hdfs/sbin/stop-all.sh > and start-all.sh > > Alex > > On Tue, Nov 24, 2015 at 10:43 PM, Ye Xianjin <advance...@gmail.com> wrote: > >> Hi AlexG: >> >> Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * >> 3 = 11.4TB. >> >> -- >> Ye Xianjin >> Sent with Sparrow <http://www.sparrowmailapp.com/?sig> >> >> On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: >> >> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 >> cluster >> with 16.73 Tb storage, using >> distcp. The dataset is a collection of tar files of about 1.7 Tb each. >> Nothing else was stored in the HDFS, but after completing the download, >> the >> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I >> see >> that the dataset only takes up 3.8 Tb as expected. I navigated through the >> entire HDFS hierarchy from /, and don't see where the missing space is. >> Any >> ideas what is going on and how to rectify it? >> >> I'm using the spark-ec2 script to launch, with the command >> >> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge >> --placement-group=pcavariants --copy-aws-credentials >> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch >> conversioncluster >> >> and am not modifying any configuration files for Hadoop. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >