Things were going smoothly.... until I hit the following: py4j.protocol.Py4JJavaError: An error occurred while calling o1282.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
Any ideas why this might occur? This is while running ALS on a subset of Netflix. It seems like spark is pouring tons of stuff onto disk for each slave. Each slave looks roughly like this: Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 4.4G 3.4G 57% / tmpfs 15G 12K 15G 1% /dev/shm /dev/xvdb 151G 25G 118G 18% /mnt /dev/xvdf 151G 24G 120G 17% /mnt2 /dev/xvdv 500G 1.8G 498G 1% /vol On Thu, Jul 17, 2014 at 1:27 PM, Bill Jay <bill.jaypeter...@gmail.com> wrote: > Hi, > > I also have some issues with repartition. In my program, I consume data > from Kafka. After I consume data, I use repartition(N). However, although I > set N to be 120, there are around 18 executors allocated for my reduce > stage. I am not sure how the repartition command works ton ensure the > parallelism. > > Bill > > > On Thu, Jul 17, 2014 at 12:56 AM, Xiangrui Meng <men...@gmail.com> wrote: > >> Set N be the total number of cores on the cluster or less. sc.textFile >> doesn't always give you that number, depends on the block size. For >> MovieLens, I think the default behavior should be 2~3 partitions. You >> need to call repartition to ensure the right number of partitions. >> >> Which EC2 instance type did you use? I usually use m3.2xlarge or c? >> instances that come with SSD and 1G or 10G network. For those >> instances, you should see local drives mounted at /mnt, /mnt2, /mnt3, >> ... Make sure there was no error when you used the ec2 script to >> launch the cluster. >> >> It is a little strange to see 94% of / was used on a slave. Maybe >> shuffle data went to /. I'm not sure which settings went wrong. I >> recommend trying re-launching a cluster with m3.2xlarge instances and >> using the default settings (don't set anything in SparkConf). Submit >> the application with --driver-memory 20g. >> >> The running times are slower than what I remember, but it depends on >> the instance type. >> >> Best, >> Xiangrui >> >> >> >> On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <chris.dub...@gmail.com> >> wrote: >> > Hi Xiangrui, >> > >> > I will try this shortly. When using N partitions, do you recommend N be >> the >> > number of cores on each slave or the number of cores on the master? >> Forgive >> > my ignorance, but is this best achieved as an argument to sc.textFile? >> > >> > The slaves on the EC2 clusters start with only 8gb of storage, and it >> > doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else >> by >> > default. Looking at spark-ec2/setup-slaves.sh, it appears that these are >> > only mounted if the instance type begins with r3. (Or am I not reading >> that >> > right?) My slaves are a different instance type, and currently look like >> > this: >> > Filesystem Size Used Avail Use% Mounted on >> > /dev/xvda1 7.9G 7.3G 515M 94% / >> > tmpfs 7.4G 4.0K 7.4G 1% /dev/shm >> > /dev/xvdv 500G 2.5G 498G 1% /vol >> > >> > I have been able to finish ALS on MovieLens 10M only twice, taking 221s >> and >> > 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound >> about >> > right, or does it point to a poor configuration? The same script with >> > MovieLens 1M runs fine in about 30-40s with the same settings. (In both >> > cases I'm training on 70% of the data.) >> > >> > Thanks for your help! >> > Chris >> > >> > >> > On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <men...@gmail.com> >> wrote: >> >> >> >> For ALS, I would recommend repartitioning the ratings to match the >> >> number of CPU cores or even less. ALS is not computation heavy for >> >> small k but communication heavy. Having small number of partitions may >> >> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the >> >> default local directory because they are local hard drives. Did your >> >> last run of ALS on MovieLens 10M-100K with the default settings >> >> succeed? -Xiangrui >> >> >> >> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <chris.dub...@gmail.com> >> >> wrote: >> >> > Hi Xiangrui, >> >> > >> >> > I accidentally did not send df -i for the master node. Here it is at >> the >> >> > moment of failure: >> >> > >> >> > Filesystem Inodes IUsed IFree IUse% Mounted on >> >> > /dev/xvda1 524288 280938 243350 54% / >> >> > tmpfs 3845409 1 3845408 1% /dev/shm >> >> > /dev/xvdb 10002432 1027 10001405 1% /mnt >> >> > /dev/xvdf 10002432 16 10002416 1% /mnt2 >> >> > /dev/xvdv 524288000 13 524287987 1% /vol >> >> > >> >> > I am using default settings now, but is there a way to make sure that >> >> > the >> >> > proper directories are being used? How many blocks/partitions do you >> >> > recommend? >> >> > >> >> > Chris >> >> > >> >> > >> >> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois < >> chris.dub...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Hi Xiangrui, >> >> >> >> >> >> Here is the result on the master node: >> >> >> $ df -i >> >> >> Filesystem Inodes IUsed IFree IUse% Mounted on >> >> >> /dev/xvda1 524288 273997 250291 53% / >> >> >> tmpfs 1917974 1 1917973 1% /dev/shm >> >> >> /dev/xvdv 524288000 30 524287970 1% /vol >> >> >> >> >> >> I have reproduced the error while using the MovieLens 10M data set >> on a >> >> >> newly created cluster. >> >> >> >> >> >> Thanks for the help. >> >> >> Chris >> >> >> >> >> >> >> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <men...@gmail.com> >> >> >> wrote: >> >> >>> >> >> >>> Hi Chris, >> >> >>> >> >> >>> Could you also try `df -i` on the master node? How many >> >> >>> blocks/partitions did you set? >> >> >>> >> >> >>> In the current implementation, ALS doesn't clean the shuffle data >> >> >>> because the operations are chained together. But it shouldn't run >> out >> >> >>> of disk space on the MovieLens dataset, which is small. spark-ec2 >> >> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by >> default, I >> >> >>> would recommend leaving this setting as the default value. >> >> >>> >> >> >>> Best, >> >> >>> Xiangrui >> >> >>> >> >> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois >> >> >>> <chris.dub...@gmail.com> >> >> >>> wrote: >> >> >>> > Thanks for the quick responses! >> >> >>> > >> >> >>> > I used your final -Dspark.local.dir suggestion, but I see this >> >> >>> > during >> >> >>> > the >> >> >>> > initialization of the application: >> >> >>> > >> >> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local >> >> >>> > directory at >> >> >>> > /vol/spark-local-20140716065608-7b2a >> >> >>> > >> >> >>> > I would have expected something in /mnt/spark/. >> >> >>> > >> >> >>> > Thanks, >> >> >>> > Chris >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cdg...@cdgore.com> >> >> >>> > wrote: >> >> >>> >> >> >> >>> >> Hi Chris, >> >> >>> >> >> >> >>> >> I've encountered this error when running Spark’s ALS methods >> too. >> >> >>> >> In >> >> >>> >> my >> >> >>> >> case, it was because I set spark.local.dir improperly, and every >> >> >>> >> time >> >> >>> >> there >> >> >>> >> was a shuffle, it would spill many GB of data onto the local >> drive. >> >> >>> >> What >> >> >>> >> fixed it was setting it to use the /mnt directory, where a >> network >> >> >>> >> drive is >> >> >>> >> mounted. For example, setting an environmental variable: >> >> >>> >> >> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | >> xargs >> >> >>> >> | >> >> >>> >> sed >> >> >>> >> 's/ /,/g’) >> >> >>> >> >> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply >> >> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your >> driver >> >> >>> >> application >> >> >>> >> >> >> >>> >> Chris >> >> >>> >> >> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <men...@gmail.com> >> >> >>> >> wrote: >> >> >>> >> >> >> >>> >> > Check the number of inodes (df -i). The assembly build may >> create >> >> >>> >> > many >> >> >>> >> > small files. -Xiangrui >> >> >>> >> > >> >> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois >> >> >>> >> > <chris.dub...@gmail.com> >> >> >>> >> > wrote: >> >> >>> >> >> Hi all, >> >> >>> >> >> >> >> >>> >> >> I am encountering the following error: >> >> >>> >> >> >> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to >> >> >>> >> >> java.io.IOException: >> >> >>> >> >> No >> >> >>> >> >> space >> >> >>> >> >> left on device [duplicate 4] >> >> >>> >> >> >> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes >> the >> >> >>> >> >> above >> >> >>> >> >> error >> >> >>> >> >> surprising. >> >> >>> >> >> >> >> >>> >> >> Filesystem Size Used Avail Use% Mounted on >> >> >>> >> >> /dev/xvda1 7.9G 4.4G 3.5G 57% / >> >> >>> >> >> tmpfs 7.4G 4.0K 7.4G 1% /dev/shm >> >> >>> >> >> /dev/xvdb 37G 3.3G 32G 10% /mnt >> >> >>> >> >> /dev/xvdf 37G 2.0G 34G 6% /mnt2 >> >> >>> >> >> /dev/xvdv 500G 33M 500G 1% /vol >> >> >>> >> >> >> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched >> using >> >> >>> >> >> the >> >> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I >> am >> >> >>> >> >> running >> >> >>> >> >> closely resembles the collaborative filtering example. This >> >> >>> >> >> issue >> >> >>> >> >> happens >> >> >>> >> >> with the 1M version as well as the 10 million rating version >> of >> >> >>> >> >> the >> >> >>> >> >> MovieLens dataset. >> >> >>> >> >> >> >> >>> >> >> I have seen previous questions, but they haven't helped yet. >> For >> >> >>> >> >> example, I >> >> >>> >> >> tried setting the Spark tmp directory to the EBS volume at >> >> >>> >> >> /vol/, >> >> >>> >> >> both >> >> >>> >> >> by >> >> >>> >> >> editing the spark conf file (and copy-dir'ing it to the >> slaves) >> >> >>> >> >> as >> >> >>> >> >> well >> >> >>> >> >> as >> >> >>> >> >> through the SparkConf. Yet I still get the above error. Here >> is >> >> >>> >> >> my >> >> >>> >> >> current >> >> >>> >> >> Spark config below. Note that I'm launching via >> >> >>> >> >> ~/spark/bin/spark-submit. >> >> >>> >> >> >> >> >>> >> >> conf = SparkConf() >> >> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir", >> >> >>> >> >> "/vol/").set("spark.executor.memory", >> >> >>> >> >> "7g").set("spark.akka.frameSize", >> >> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", " >> >> >>> >> >> -Dspark.akka.frameSize=100") >> >> >>> >> >> sc = SparkContext(conf=conf) >> >> >>> >> >> >> >> >>> >> >> Thanks for any advice, >> >> >>> >> >> Chris >> >> >>> >> >> >> >> >>> >> >> >> >>> > >> >> >> >> >> >> >> >> > >> > >> > >> > >