[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174660#comment-16174660 ]
Serge Smertin edited comment on SPARK-18727 at 9/21/17 12:31 PM: ----------------------------------------------------------------- i have some similar use-cases that were mentioned in [#comment-15987668] by [~simeons] - adding fields to nested _struct_ fields. application is built the way that parquet files are created/partitioned outside of Spark and only new columns might be added. Again, mostly within couple of nested structs. I don't know all potential implications of the idea, but can we just use the last element of selected files instead of the first one, as long as the FileStatus [list is already sorted by path lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]? It easier to guarantee that only new columns would be added over the time. And the following code change doesn't seem to be huge deviation from current behavior, thus tremendously saving time compared to {{spark.sql.parquet.mergeSchema=true}}: {code:java} // ParquetFileFormat.scala (lines 232..240) filesByType.commonMetadata.lastOption .orElse(filesByType.metadata.lastOption) .orElse(filesByType.data.lastOption) {code} /cc [~r...@databricks.com] [~xwu0226] was (Author: nfx): in one of the use-cases for project in [#comment-15987668] by [~simeons] - adding fields to nested _struct_ fields. application is built the way that parquet files are created/partitioned outside of Spark and only new columns might be added. Again, mostly within couple of nested structs. I don't know all potential implications of the idea, but can we just use the last element of selected files instead of the first one, as long as the FileStatus [list is already sorted by path lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]? It easier to guarantee that only new columns would be added over the time. And the following code change doesn't seem to be huge deviation from current behavior, thus tremendously saving time compared to {{spark.sql.parquet.mergeSchema=true}}: {code:java} // ParquetFileFormat.scala (lines 232..240) filesByType.commonMetadata.lastOption .orElse(filesByType.metadata.lastOption) .orElse(filesByType.data.lastOption) {code} /cc [~r...@databricks.com] [~xwu0226] > Support schema evolution as new files are inserted into table > ------------------------------------------------------------- > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Eric Liang > Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org