Re: Error: No space left on device

Chris DuBois Thu, 17 Jul 2014 15:31:26 -0700

Things were going smoothly.... until I hit the following:

py4j.protocol.Py4JJavaError: An error occurred while calling o1282.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED


Any ideas why this might occur? This is while running ALS on a subset of
Netflix.

It seems like spark is pouring tons of stuff onto disk for each slave. Each
slave looks roughly like this:

Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  4.4G  3.4G  57% /
tmpfs                  15G   12K   15G   1% /dev/shm
/dev/xvdb             151G   25G  118G  18% /mnt
/dev/xvdf             151G   24G  120G  17% /mnt2
/dev/xvdv             500G  1.8G  498G   1% /vol


On Thu, Jul 17, 2014 at 1:27 PM, Bill Jay <bill.jaypeter...@gmail.com>
wrote:

> Hi，
>
> I also have some issues with repartition. In my program, I consume data
> from Kafka. After I consume data, I use repartition(N). However, although I
> set N to be 120, there are around 18 executors allocated for my reduce
> stage. I am not sure how the repartition command works ton ensure the
> parallelism.
>
> Bill
>
>
> On Thu, Jul 17, 2014 at 12:56 AM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> Set N be the total number of cores on the cluster or less. sc.textFile
>> doesn't always give you that number, depends on the block size. For
>> MovieLens, I think the default behavior should be 2~3 partitions. You
>> need to call repartition to ensure the right number of partitions.
>>
>> Which EC2 instance type did you use? I usually use m3.2xlarge or c?
>> instances that come with SSD and 1G or 10G network. For those
>> instances, you should see local drives mounted at /mnt, /mnt2, /mnt3,
>> ... Make sure there was no error when you used the ec2 script to
>> launch the cluster.
>>
>> It is a little strange to see 94% of / was used on a slave. Maybe
>> shuffle data went to /. I'm not sure which settings went wrong. I
>> recommend trying re-launching a cluster with m3.2xlarge instances and
>> using the default settings (don't set anything in SparkConf). Submit
>> the application with --driver-memory 20g.
>>
>> The running times are slower than what I remember, but it depends on
>> the instance type.
>>
>> Best,
>> Xiangrui
>>
>>
>>
>> On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <chris.dub...@gmail.com>
>> wrote:
>> > Hi Xiangrui,
>> >
>> > I will try this shortly. When using N partitions, do you recommend N be
>> the
>> > number of cores on each slave or the number of cores on the master?
>> Forgive
>> > my ignorance, but is this best achieved as an argument to sc.textFile?
>> >
>> > The slaves on the EC2 clusters start with only 8gb of storage, and it
>> > doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else
>> by
>> > default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
>> > only mounted if the instance type begins with r3. (Or am I not reading
>> that
>> > right?) My slaves are a different instance type, and currently look like
>> > this:
>> > Filesystem            Size  Used Avail Use% Mounted on
>> > /dev/xvda1            7.9G  7.3G  515M  94% /
>> > tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> > /dev/xvdv             500G  2.5G  498G   1% /vol
>> >
>> > I have been able to finish ALS on MovieLens 10M only twice, taking 221s
>> and
>> > 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
>> about
>> > right, or does it point to a poor configuration? The same script with
>> > MovieLens 1M runs fine in about 30-40s with the same settings. (In both
>> > cases I'm training on 70% of the data.)
>> >
>> > Thanks for your help!
>> > Chris
>> >
>> >
>> > On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <men...@gmail.com>
>> wrote:
>> >>
>> >> For ALS, I would recommend repartitioning the ratings to match the
>> >> number of CPU cores or even less. ALS is not computation heavy for
>> >> small k but communication heavy. Having small number of partitions may
>> >> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
>> >> default local directory because they are local hard drives. Did your
>> >> last run of ALS on MovieLens 10M-100K with the default settings
>> >> succeed? -Xiangrui
>> >>
>> >> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <chris.dub...@gmail.com>
>> >> wrote:
>> >> > Hi Xiangrui,
>> >> >
>> >> > I accidentally did not send df -i for the master node. Here it is at
>> the
>> >> > moment of failure:
>> >> >
>> >> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> >> > /dev/xvda1            524288  280938  243350   54% /
>> >> > tmpfs                3845409       1 3845408    1% /dev/shm
>> >> > /dev/xvdb            10002432    1027 10001405    1% /mnt
>> >> > /dev/xvdf            10002432      16 10002416    1% /mnt2
>> >> > /dev/xvdv            524288000      13 524287987    1% /vol
>> >> >
>> >> > I am using default settings now, but is there a way to make sure that
>> >> > the
>> >> > proper directories are being used? How many blocks/partitions do you
>> >> > recommend?
>> >> >
>> >> > Chris
>> >> >
>> >> >
>> >> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <
>> chris.dub...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Xiangrui,
>> >> >>
>> >> >> Here is the result on the master node:
>> >> >> $ df -i
>> >> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> >> >> /dev/xvda1            524288  273997  250291   53% /
>> >> >> tmpfs                1917974       1 1917973    1% /dev/shm
>> >> >> /dev/xvdv            524288000      30 524287970    1% /vol
>> >> >>
>> >> >> I have reproduced the error while using the MovieLens 10M data set
>> on a
>> >> >> newly created cluster.
>> >> >>
>> >> >> Thanks for the help.
>> >> >> Chris
>> >> >>
>> >> >>
>> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <men...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Chris,
>> >> >>>
>> >> >>> Could you also try `df -i` on the master node? How many
>> >> >>> blocks/partitions did you set?
>> >> >>>
>> >> >>> In the current implementation, ALS doesn't clean the shuffle data
>> >> >>> because the operations are chained together. But it shouldn't run
>> out
>> >> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
>> >> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by
>> default, I
>> >> >>> would recommend leaving this setting as the default value.
>> >> >>>
>> >> >>> Best,
>> >> >>> Xiangrui
>> >> >>>
>> >> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
>> >> >>> <chris.dub...@gmail.com>
>> >> >>> wrote:
>> >> >>> > Thanks for the quick responses!
>> >> >>> >
>> >> >>> > I used your final -Dspark.local.dir suggestion, but I see this
>> >> >>> > during
>> >> >>> > the
>> >> >>> > initialization of the application:
>> >> >>> >
>> >> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
>> >> >>> > directory at
>> >> >>> > /vol/spark-local-20140716065608-7b2a
>> >> >>> >
>> >> >>> > I would have expected something in /mnt/spark/.
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Chris
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cdg...@cdgore.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Hi Chris,
>> >> >>> >>
>> >> >>> >> I've encountered this error when running Spark’s ALS methods
>> too.
>> >> >>> >> In
>> >> >>> >> my
>> >> >>> >> case, it was because I set spark.local.dir improperly, and every
>> >> >>> >> time
>> >> >>> >> there
>> >> >>> >> was a shuffle, it would spill many GB of data onto the local
>> drive.
>> >> >>> >> What
>> >> >>> >> fixed it was setting it to use the /mnt directory, where a
>> network
>> >> >>> >> drive is
>> >> >>> >> mounted.  For example, setting an environmental variable:
>> >> >>> >>
>> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' |
>> xargs
>> >> >>> >> |
>> >> >>> >> sed
>> >> >>> >> 's/ /,/g’)
>> >> >>> >>
>> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
>> >> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
>> driver
>> >> >>> >> application
>> >> >>> >>
>> >> >>> >> Chris
>> >> >>> >>
>> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <men...@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >>
>> >> >>> >> > Check the number of inodes (df -i). The assembly build may
>> create
>> >> >>> >> > many
>> >> >>> >> > small files. -Xiangrui
>> >> >>> >> >
>> >> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
>> >> >>> >> > <chris.dub...@gmail.com>
>> >> >>> >> > wrote:
>> >> >>> >> >> Hi all,
>> >> >>> >> >>
>> >> >>> >> >> I am encountering the following error:
>> >> >>> >> >>
>> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
>> >> >>> >> >> java.io.IOException:
>> >> >>> >> >> No
>> >> >>> >> >> space
>> >> >>> >> >> left on device [duplicate 4]
>> >> >>> >> >>
>> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes
>> the
>> >> >>> >> >> above
>> >> >>> >> >> error
>> >> >>> >> >> surprising.
>> >> >>> >> >>
>> >> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
>> >> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>> >> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> >> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
>> >> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>> >> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
>> >> >>> >> >>
>> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
>> using
>> >> >>> >> >> the
>> >> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I
>> am
>> >> >>> >> >> running
>> >> >>> >> >> closely resembles the collaborative filtering example. This
>> >> >>> >> >> issue
>> >> >>> >> >> happens
>> >> >>> >> >> with the 1M version as well as the 10 million rating version
>> of
>> >> >>> >> >> the
>> >> >>> >> >> MovieLens dataset.
>> >> >>> >> >>
>> >> >>> >> >> I have seen previous questions, but they haven't helped yet.
>> For
>> >> >>> >> >> example, I
>> >> >>> >> >> tried setting the Spark tmp directory to the EBS volume at
>> >> >>> >> >> /vol/,
>> >> >>> >> >> both
>> >> >>> >> >> by
>> >> >>> >> >> editing the spark conf file (and copy-dir'ing it to the
>> slaves)
>> >> >>> >> >> as
>> >> >>> >> >> well
>> >> >>> >> >> as
>> >> >>> >> >> through the SparkConf. Yet I still get the above error. Here
>> is
>> >> >>> >> >> my
>> >> >>> >> >> current
>> >> >>> >> >> Spark config below. Note that I'm launching via
>> >> >>> >> >> ~/spark/bin/spark-submit.
>> >> >>> >> >>
>> >> >>> >> >> conf = SparkConf()
>> >> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
>> >> >>> >> >> "/vol/").set("spark.executor.memory",
>> >> >>> >> >> "7g").set("spark.akka.frameSize",
>> >> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
>> >> >>> >> >> -Dspark.akka.frameSize=100")
>> >> >>> >> >> sc = SparkContext(conf=conf)
>> >> >>> >> >>
>> >> >>> >> >> Thanks for any advice,
>> >> >>> >> >> Chris
>> >> >>> >> >>
>> >> >>> >>
>> >> >>> >
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: Error: No space left on device

Reply via email to