[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496022#comment-17496022
 ] 

Steve Loughran commented on SPARK-38115:


bq. Is there any config as such to stop using FileOutputCommiter, because we 
didn't set any conf explicitly to use the committers.
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/committers.html

bq. Just I am looking if I can use conf/options to manage temporary location as 
staging and have target path as primary

no, because the commit-by-rename mechanism is broken on s3; tuning temp dir 
location isn't going to fix that

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-15 Thread kk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492824#comment-17492824
 ] 

kk commented on SPARK-38115:


Is there any config as such to stop using FileOutputCommiter, because we didn't 
set any conf explicitly to use the committers.

And more over when overwriting on s3:// then i don't have a problem of 
_temporary. Problem comes if our path has s3a://

Just I am looking if I can use conf/options to manage temporary location as 
staging and have target path as primary

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-15 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492810#comment-17492810
 ] 

Steve Loughran commented on SPARK-38115:


* stop using the classic FileOutputCommitter for your work, unless you like 
waiting a long time for your jobs to complete. along with a risk of corrupt 
data in the presence of worker failures.
* the choice of where temporary paths go is a function of the committer, not 
the spark codebase. the s3a staging committer uses the local fs. for example
* the magic committer does work under _temporary, but it doesn't write the 
final data there. it's "magic", after all. l

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-15 Thread kk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492785#comment-17492785
 ] 

kk commented on SPARK-38115:


Hello [~hyukjin.kwon] did you get a chance to look into this

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-07 Thread kk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488445#comment-17488445
 ] 

kk commented on SPARK-38115:


Thanks [~hyukjin.kwon] for responding.

Basically I am trying to write data to s3 from spark dataframe. And this will 
use FileOutputCommitter by spark.

[https://stackoverflow.com/questions/46665299/spark-avoid-creating-temporary-directory-in-s3]

Now my requirement is to either change the '{*}_temporary{*}' path to write to 
different s3 bucket and copy to original s3 by setting any spark conf or 
parameter part of write step.

or 

stop creating *_temporary* when writing to s3. 

As we have version enabled bucket the _temporary is being stored in the version 
even though it is not physically present.

Below is the write step:

df.coalesce(1).write.format('parquet').mode('overwrite').save('{*}s3a{*}://outpath')

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487846#comment-17487846
 ] 

Hyukjin Kwon commented on SPARK-38115:
--

It would be great to elaborate the use case here. 

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487845#comment-17487845
 ] 

Hyukjin Kwon commented on SPARK-38115:
--

Okay, I guess you referred 
https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/hadoop-cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/StubPathOutputCommitter.scala#L103?

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Major
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-06 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487844#comment-17487844
 ] 

Hyukjin Kwon commented on SPARK-38115:
--

What is "_temporary", and where is this used? Do you have any reproducer?

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Major
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org