[ https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15772014#comment-15772014 ]
Takeshi Yamamuro commented on SPARK-18199: ------------------------------------------ Have you check this https://github.com/apache/spark/pull/16281? The parquet community will release v1.8.2 for backports and spark's planning to upgrade to v1.8.2. So, IIUC spark has no plan to upgrade to v1.9.0 for now. > Support appending to Parquet files > ---------------------------------- > > Key: SPARK-18199 > URL: https://issues.apache.org/jira/browse/SPARK-18199 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Jeremy Smith > > Currently, appending to a Parquet directory involves simply creating new > parquet files in the directory. With many small appends (for example, in a > streaming job with a short batch duration) this leads to an unbounded number > of small Parquet files accumulating. These must be cleaned up with some > frequency by removing them all and rewriting a new file containing all the > rows. > It would be far better if Spark supported appending to the Parquet files > themselves. HDFS supports this, as does Parquet: > * The Parquet footer can be read in order to obtain necessary metadata. > * The new rows can then be appended to the Parquet file as a row group. > * A new footer can then be appended containing the metadata and referencing > the new row groups as well as the previously existing row groups. > This would result in a small amount of bloat in the file as new row groups > are added (since duplicate metadata would accumulate) but it's hugely > preferable to accumulating small files, which is bad for HDFS health and also > eventually leads to Spark being unable to read the Parquet directory at all. > Periodic rewriting of the file could still be performed in order to remove > the duplicate metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org