[jira] [Commented] (SPARK-18199) Support appending to Parquet files

Takeshi Yamamuro (JIRA) Thu, 22 Dec 2016 21:53:05 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15772014#comment-15772014
 ]


Takeshi Yamamuro commented on SPARK-18199:
------------------------------------------

Have you check this https://github.com/apache/spark/pull/16281?
The parquet community will release v1.8.2 for backports and spark's planning to 
upgrade to v1.8.2.
So, IIUC spark has no plan to upgrade to v1.9.0 for now.

> Support appending to Parquet files
> ----------------------------------
>
>                 Key: SPARK-18199
>                 URL: https://issues.apache.org/jira/browse/SPARK-18199
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new 
> parquet files in the directory. With many small appends (for example, in a 
> streaming job with a short batch duration) this leads to an unbounded number 
> of small Parquet files accumulating. These must be cleaned up with some 
> frequency by removing them all and rewriting a new file containing all the 
> rows.
> It would be far better if Spark supported appending to the Parquet files 
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing 
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups 
> are added (since duplicate metadata would accumulate) but it's hugely 
> preferable to accumulating small files, which is bad for HDFS health and also 
> eventually leads to Spark being unable to read the Parquet directory at all.  
> Periodic rewriting of the file could still be performed in order to remove 
> the duplicate metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18199) Support appending to Parquet files

Reply via email to