Jakub Nowacki created SPARK-20049:
-------------------------------------

             Summary: Writing data to Parquet with partitions takes very long 
after the job finishes
                 Key: SPARK-20049
                 URL: https://issues.apache.org/jira/browse/SPARK-20049
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, PySpark, SQL
    Affects Versions: 2.1.0
         Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
GNU/Linux 8.7 (jessie)
            Reporter: Jakub Nowacki


I was testing writing DataFrame to partitioned Parquet files.The command is 
quite straight forward and the data set is really a sample from larger data set 
in Parquet; the job is done in PySpark on YARN and written to HDFS:
{code}
# there is column 'date' in df
df.write.partitionBy("date").parquet("dest_dir")
{code}
The reading part took as long as usual, but after the job has been marked in 
PySpark and UI as finished, the Python interpreter still was showing it as 
busy. Indeed, when I checked the HDFS folder I noticed that the files are still 
transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
folders. 

First of all it takes much longer than saving the same set without 
partitioning. Second, it is done in the background, without visible progress of 
any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to