You are using yarn-client mode, the driver is not yarn container, so it can not use yarn.nodemanager.local-dirs, only have to use spark.local.dir which /tmp by default. But usually driver won't cost too much disk, so it should be fine to use /tmp in driver side.
On Tue, Mar 1, 2016 at 4:57 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > spark 1.6.0 uses /tmp in the following places > # spark.local.dir is not set > yarn.nodemanager.local-dirs=/data01/yarn/nm,/data02/yarn/nm > > 1. spark-shell on start > 16/03/01 08:33:48 INFO storage.DiskBlockManager: Created local directory > at /tmp/blockmgr-ffd3143d-b47f-4844-99fd-2d51c6a05d05 > > 2. spark-shell on start > 16/03/01 08:33:50 INFO yarn.Client: Uploading resource > file:/tmp/spark-456184c9-d59f-48f4-a9b0-560b7d310655/__spark_conf__6943938018805427428.zip > -> > hdfs://ip-10-101-124-30:8020/user/hadoop/.sparkStaging/application_1456776184284_0047/__spark_conf__6943938018805427428.zip > > 3. spark-shell spark-sql (Hive) on start > 16/03/01 08:34:06 INFO session.SessionState: Created local directory: > /tmp/01705299-a384-4e85-923b-e858017cf351_resources > 16/03/01 08:34:06 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/01705299-a384-4e85-923b-e858017cf351 > 16/03/01 08:34:06 INFO session.SessionState: Created local directory: > /tmp/hadoop/01705299-a384-4e85-923b-e858017cf351 > 16/03/01 08:34:06 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/01705299-a384-4e85-923b-e858017cf351/_tmp_space.db > > 4. Spark Executor container uses hadoop.tmp.dir /data01/tmp/hadoop-${ > user.name} for s3 output > > scala> sc.parallelize(1 to > 10).saveAsTextFile("s3n://my_bucket/test/p10_13"); > > 16/03/01 08:41:13 INFO s3native.NativeS3FileSystem: OutputStream for key > 'test/p10_13/part-00000' writing to tempfile > '/data01/tmp/hadoop-hadoop/s3/output-7399167152756918334.tmp' > > > -------------------------------------------------- > > if I set spark.local.dir=/data01/tmp then #1 and #2 uses /data01/tmp instead > of /tmp > > -------------------------------------------------- > > > 1. 16/03/01 08:47:03 INFO storage.DiskBlockManager: Created local directory > at /data01/tmp/blockmgr-db88dbd2-0ef4-433a-95ea-b33392bbfb7f > > > 2. 16/03/01 08:47:05 INFO yarn.Client: Uploading resource > file:/data01/tmp/spark-aa3e619c-a368-4f95-bd41-8448a78ae456/__spark_conf__368426817234224667.zip > -> > hdfs://ip-10-101-124-30:8020/user/hadoop/.sparkStaging/application_1456776184284_0050/__spark_conf__368426817234224667.zip > > > 3. spark-sql (hive) still uses /tmp > > 16/03/01 08:47:20 INFO session.SessionState: Created local directory: > /tmp/d315926f-39d7-4dcb-b3fa-60e9976f7197_resources > 16/03/01 08:47:20 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197 > 16/03/01 08:47:20 INFO session.SessionState: Created local directory: > /tmp/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197 > 16/03/01 08:47:20 INFO session.SessionState: Created HDFS directory: > /tmp/hive/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197/_tmp_space.db > > > 4. executor uses hadoop.tmp.dir for s3 output > > 16/03/01 08:50:01 INFO s3native.NativeS3FileSystem: OutputStream for key > 'test/p10_16/_SUCCESS' writing to tempfile > '/data01/tmp/hadoop-hadoop/s3/output-2541604454681305094.tmp' > > > 5. /data0X/yarn/nm used for usercache > > 16/03/01 08:41:12 INFO storage.DiskBlockManager: Created local directory at > /data01/yarn/nm/usercache/hadoop/appcache/application_1456776184284_0047/blockmgr-af5 > > > > On Mon, Feb 29, 2016 at 3:44 PM, Jeff Zhang <zjf...@gmail.com> wrote: > >> In yarn mode, spark.local.dir is yarn.nodemanager.local-dirs for shuffle >> data and block manager disk data. What do you mean "But output files to >> upload to s3 still created in /tmp on slaves" ? You should have control on >> where to store your output data if that means your job's output. >> >> On Tue, Mar 1, 2016 at 3:12 AM, Alexander Pivovarov <apivova...@gmail.com >> > wrote: >> >>> I have Spark on yarn >>> >>> I defined yarn.nodemanager.local-dirs to be >>> /data01/yarn/nm,/data02/yarn/nm >>> >>> when I look at yarn executor container log I see that blockmanager files >>> created in /data01/yarn/nm,/data02/yarn/nm >>> >>> But output files to upload to s3 still created in /tmp on slaves >>> >>> I do not want Spark write heavy files to /tmp because /tmp is only 5GB >>> >>> spark slaves have two big additional disks /disk01 and /disk02 attached >>> >>> Probably I can set spark.local.dir to be /data01/tmp,/data02/tmp >>> >>> But spark master also writes some files to spark.local.dir >>> But my master box has only one additional disk /data01 >>> >>> So, what should I use for spark.local.dir the >>> spark.local.dir=/data01/tmp >>> or >>> spark.local.dir=/data01/tmp,/data02/tmp >>> >>> ? >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang