[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2019-04-27 Thread Piero Cinquegrana (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827645#comment-16827645
 ] 

Piero Cinquegrana commented on SPARK-20049:
---

[~yumwang] and [~jdramosf] we encountered a similar issue writing parquet files 
to S3 with partition from many gzipped files. The workaround was writing 
unpartitioned parquet files as an intermediate job. 

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2019-02-25 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776677#comment-16776677
 ] 

Yuming Wang commented on SPARK-20049:
-

Could you try to set 
{{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}}?

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2019-02-04 Thread Juan Ramos Fuentes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760123#comment-16760123
 ] 

Juan Ramos Fuentes commented on SPARK-20049:


Was there ever a fix for this issue? I'm running into the same but when writing 
to s3

I'll try repartitioning first and see if that improves performance

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2017-03-22 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936421#comment-15936421
 ] 

Jakub Nowacki commented on SPARK-20049:
---

I did a bit more and the writing and, as it came out, reading performance was 
low due to the number of files per partition most likely. Namely, every folder 
contained the number of files corresponding to the number of partitions of 
saved DataFrame, which was just over 3000 in my case.  Repartitioning like:
{code}
# there is column 'date' in df
df.repartition("date").write.partitionBy("date").parquet("dest_dir")
{code}
fixes the issue, though, creates one file per partition, which is a bit too 
much in my case, but this can be fixed e.g.:
{code}
# there is column 'date' in df
df.repartition("date", 
hour("createdAt")).write.partitionBy("date").parquet("dest_dir")
{code}
which works similarly but files in the partition folders are smaller.

So IMO there are 4 issues to address:
# for some reason there is a long time of writing files on HDFS, which is not 
indicated anywhere and takes much longer than normal write (in my case 5 
minutes vs 1.5 hour)
# some form of additional progress indicator should be included somewhere in 
UI, logs and/or shell output
# suggestion of repartitioning before using {{partitionBy}} should be 
highlighted in the documentation
# maybe automatic repartitioning before saving should be considered, though, 
this can be controversial

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org