part-xxx instead of directly saving in outputDir

omkar puttagunta (JIRA) Thu, 13 Sep 2018 19:04:31 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614245#comment-16614245
 ]


omkar puttagunta edited comment on SPARK-25293 at 9/14/18 2:03 AM:
-------------------------------------------------------------------

[~hyukjin.kwon] tested with 2.1.3, got the  same issue. My stack overflow 
question got answers saying that this is due to lack of  "shared file system". 
Is it the real reason?

I am running spark in standalone mode, no HDFS, or any other distributed file 
system

If I use the fileOutputCommiter Version 2, will I get the desired result?

 

 

[https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu]

 


was (Author: omkar999):
[~hyukjin.kwon] tested with 2.1.3, got the  same issue. My stack overflow 
question got answers saying that this is due to lack of  "shared file system". 
Is it the real reason?

If I use the fileOutputCommiter Version 2, will I get the desired result?

 

 

[https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu]

 

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25293
>                 URL: https://issues.apache.org/jira/browse/SPARK-25293
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2, Java API, Spark Shell, Spark Submit
>    Affects Versions: 2.0.2, 2.1.3
>            Reporter: omkar puttagunta
>            Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u0000").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-xxxxx-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-xxxx}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

Reply via email to