Multi user setup and saving a DataFrame / RDD to a network exported file system

Tomasz Fruboes Tue, 19 May 2015 08:17:14 -0700

Dear Experts,

we have a spark cluster (standalone mode) in which master and workersare started from root account. Everything runs correctly to the pointwhen we try doing operations such as


    dataFrame.select("name", "age").save(ofile, "parquet")

or

    rdd.saveAsPickleFile(ofile)

, where ofile is path on a network exported filesystem (visible on allnodes, in our case this is lustre, I guess on nfs effect would be similar).

Unsurprisingly temp files created on workers are owned by root, whichthen leads to a crash (see [1] below). Is there a solution/workaroundfor this (e.g. controlling file creation mode of the temporary files)?


Cheers,
 Tomasz

ps I've tried to google this problem, couple of similar reports, but noclear answer/solution found

ps2 For completeness - running master/workers as a regular user solvesthe problem only for the given user. For other users submitting to thismaster the result is given in [2] below



[0] Cluster details:
Master/workers: centos 6.5
Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build)


[1]
##################################################################

File"/mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o27.save.

: java.io.IOException: Failed to renameDeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_000001/part-r-00002.parquet;isDirectory=false; length=534; replication=1; blocksize=33554432;modification_time=1432042832000; access_time=0; owner=; group=;permission=rw-rw-rw-; isSymlink=false} tofile:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-00002.parquetatorg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)atorg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)atorg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)atparquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43)atorg.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690)atorg.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129)atorg.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)

        at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
        at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1181)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

atpy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

        at py4j.Gateway.invoke(Gateway.java:259)

atpy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)
##################################################################



[2]
##################################################################

15/05/19 14:45:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID3, wn23023.cis.gov.pl): java.io.IOException: Mkdirs failed to createfile:/mnt/lustre/bigdata/med_home/tmp/test18/namesAndAges.parquet2/_temporary/0/_temporary/attempt_201505191445_0009_r_000000_0atorg.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)atorg.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)

        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)

atparquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)atparquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)atorg.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)atorg.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)atorg.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)
##################################################################

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Multi user setup and saving a DataFrame / RDD to a network exported file system

Reply via email to