[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402472#comment-17402472 ] ASF GitHub Bot commented on HUDI-1548: -- vinothchandar merged pull request #3512: URL: https://github.com/apache/hudi/pull/3512 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402074#comment-17402074 ] ASF GitHub Bot commented on HUDI-1548: -- codope opened a new pull request #3512: URL: https://github.com/apache/hudi/pull/3512 ## What is the purpose of the pull request Move the current [schema evolution section](https://hudi.apache.org/docs/next/writing_data#schema-evolution) from `Writing Data` page to its own page in the Docs. ## Verify this pull request Verified the change locally. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385241#comment-17385241 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan merged pull request #3257: URL: https://github.com/apache/hudi/pull/3257 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384528#comment-17384528 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r673561465 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | Review comment: awesome. in line with my understanding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384529#comment-17384529 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r673563019 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | +| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | | +| Promote datatype from `int` to `long` for a field at root level | Yes | Yes | For other types, Hudi supports promotion as specified in [Avro schema resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). | +| Promote datatype from `int` to `long` for a nested field | Yes | Yes | +| Promote datatype from `int` to `long` for a complex type (value of map or array) | Yes | Yes | | +| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. | +| Add a new non-nullable column to inner struct (at the end) | No | No | | +| Change datatype from `long` to `int` for a nested field | No | No | | +| Change datatype from `long` to `int` for a complex type (value of map or array) | No | No | | + +Let us walk through an example to demonstrate the schema evolution support in Hudi. +In the below example, we are going to add a new string field and change the datatype of a field from int to long. + +```java +Welcome to + __ +/ __/__ ___ _/ /__ +_\ \/ _ \/ _ `/ __/ '_/ +/___/ .__/\_,_/_/ /_/\_\ version 3.1.2 +/_/ + +Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) +Type in expressions to have them evaluated. +Type :help for more information. + +scala> import org.apache.hudi.QuickstartUtils._ +import org.apache.hudi.QuickstartUtils._ + +scala> import scala.collection.JavaConversions._ +import scala.collection.JavaConversions._ + +scala> import org.apache.spark.sql.SaveMode._ +import org.apache.spark.sql.SaveMode._ + +scala> import org.apache.hudi.DataSourceReadOptions._ +import org.apache.hudi.DataSourceReadOptions._ + +scala> import org.apache.hudi.DataSourceWriteOptions._ +import org.apache.hudi.DataSourceWriteOptions._ + +scala> import org.apache.hudi.config.HoodieWriteConfig._ +import org.apache.hudi.config.HoodieWriteConfig._ + +scala> import org.apache.spark.sql.types._ +import org.apache.spark.sql.types._ + +scala> import org.apache.spark.sql.Row +import org.apache.spark.sql.Row + +scala> val tableName = "hudi_trips_cow" +tableName: String = hudi_trips_cow +scala> val basePath = "file:///tmp/hudi_trips_cow" +basePath: String = file:///tmp/hudi_trips_cow +scala> val schema = StructType( Array( +| StructField("rowId", StringType,true), +| StructField("partitionId", StringType,true), +| StructField("preComb", LongType,true), +| StructField("name", StringType,true), +| StructField("versionId", StringType,true), +| StructField("intToLong", IntegerType,true) +| )) +schema: org.apache.spark.sql.types.StructType = StructType(StructField(rowId,StringType,true), StructField(partitionId,StringType,true), StructField(preComb,LongType,true), StructField(name,StringType,true), StructField(versionId,StringType
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384341#comment-17384341 ] ASF GitHub Bot commented on HUDI-1548: -- codope commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r673228670 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | Review comment: Ran this test with Spark datasource. The write succeeds but read failed with below stacktrace: ``` java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/h
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383129#comment-17383129 ] ASF GitHub Bot commented on HUDI-1548: -- danny0405 commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r672081905 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | +| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | | +| Promote datatype from `int` to `long` for a field at root level | Yes | Yes | For other types, Hudi supports promotion as specified in [Avro schema resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). | +| Promote datatype from `int` to `long` for a nested field | Yes | Yes | +| Promote datatype from `int` to `long` for a complex type (value of map or array) | Yes | Yes | | +| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. | +| Add a new non-nullable column to inner struct (at the end) | No | No | | +| Change datatype from `long` to `int` for a nested field | No | No | | +| Change datatype from `long` to `int` for a complex type (value of map or array) | No | No | | + +Let us walk through an example to demonstrate the schema evolution support in Hudi. +In the below example, we are going to add a new string field and change the datatype of a field from int to long. + +```java +Welcome to + __ +/ __/__ ___ _/ /__ +_\ \/ _ \/ _ `/ __/ '_/ +/___/ .__/\_,_/_/ /_/\_\ version 3.1.2 +/_/ + +Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) +Type in expressions to have them evaluated. +Type :help for more information. + +scala> import org.apache.hudi.QuickstartUtils._ +import org.apache.hudi.QuickstartUtils._ + +scala> import scala.collection.JavaConversions._ +import scala.collection.JavaConversions._ + +scala> import org.apache.spark.sql.SaveMode._ +import org.apache.spark.sql.SaveMode._ + +scala> import org.apache.hudi.DataSourceReadOptions._ +import org.apache.hudi.DataSourceReadOptions._ + +scala> import org.apache.hudi.DataSourceWriteOptions._ +import org.apache.hudi.DataSourceWriteOptions._ + +scala> import org.apache.hudi.config.HoodieWriteConfig._ +import org.apache.hudi.config.HoodieWriteConfig._ + +scala> import org.apache.spark.sql.types._ +import org.apache.spark.sql.types._ + +scala> import org.apache.spark.sql.Row +import org.apache.spark.sql.Row + +scala> val tableName = "hudi_trips_cow" +tableName: String = hudi_trips_cow +scala> val basePath = "file:///tmp/hudi_trips_cow" +basePath: String = file:///tmp/hudi_trips_cow +scala> val schema = StructType( Array( +| StructField("rowId", StringType,true), +| StructField("partitionId", StringType,true), +| StructField("preComb", LongType,true), +| StructField("name", StringType,true), +| StructField("versionId", StringType,true), +| StructField("intToLong", IntegerType,true) +| )) +schema: org.apache.spark.sql.types.StructType = StructType(StructField(rowId,StringType,true), StructField(partitionId,StringType,true), StructField(preComb,LongType,true), StructField(name,StringType,true), StructField(versionId,StringType,
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383124#comment-17383124 ] ASF GitHub Bot commented on HUDI-1548: -- danny0405 commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r672074150 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | +| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | | +| Promote datatype from `int` to `long` for a field at root level | Yes | Yes | For other types, Hudi supports promotion as specified in [Avro schema resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). | +| Promote datatype from `int` to `long` for a nested field | Yes | Yes | +| Promote datatype from `int` to `long` for a complex type (value of map or array) | Yes | Yes | | +| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. | Review comment: Hmm, i plan to add a new metadata column in HUDI-1771 named `_hoodie_operation` to record the change flag. Very good news to see that this expects to be compatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383099#comment-17383099 ] ASF GitHub Bot commented on HUDI-1548: -- danny0405 commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r672063103 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | Review comment: Thanks, let us take a look ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382250#comment-17382250 ] ASF GitHub Bot commented on HUDI-1548: -- vinothchandar commented on pull request #3257: URL: https://github.com/apache/hudi/pull/3257#issuecomment-881621090 @umehrot2 @rmpifer can you please review this? may be in context of some other issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380889#comment-17380889 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r670009907 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | Review comment: I don't mean to drag this patch per se. But something to keep in mind and get it done when someone has cycles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380886#comment-17380886 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r670009217 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | Review comment: @danny0405 @yanghua @leesf : May be someone from flink can do a similar exercise (try out all these) and certify. We can add the same in line 433. Or add another column to call out the engines where certain schema evolution works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380887#comment-17380887 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r670009217 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | Review comment: @danny0405 @yanghua @leesf : May be someone from flink can do a similar exercise (try out all these) and certify. We can add "flink" to line 433. Or add another column to call out the engines where certain schema evolution works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380218#comment-17380218 ] ASF GitHub Bot commented on HUDI-1548: -- nsivabalan commented on a change in pull request #3257: URL: https://github.com/apache/hudi/pull/3257#discussion_r669171486 ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | Review comment: I have some clarification wrt ordering change. will sync up directly. basically, I wanna understand if you had tried this. Commit1: which creates 2 base files. commit2 w/ diff ordering: updates just 1 base file out of 2. Does read succeed now? Also, we could try, where a new commit creates a completely new base file and does not touch any of the existing base files. ## File path: docs/_docs/2_2_writing_data.md ## @@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of your Hudi tables. - Intelligently tuning the [bulk insert parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in nicely sized initial file groups. It is in fact critical to get this right, since the file groups once created cannot be deleted, but simply expanded as explained before. - For workloads with heavy updates, the [merge-on-read table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for ingesting quickly into smaller files and then later merging them into larger base files via compaction. + + +## Schema Evolution + +Schema evolution is a very important aspect of data management. +Hudi supports common schema evolution scenarios, such as adding a nullable field or promoting a datatype of a field, out-of-the-box. +Furthermore, the evolved schema is queryable across engines, such as Presto, Hive and Spark SQL. +The following table presents a summary of the types of schema changes compatible with different Hudi table types. + +| Schema Change | COW | MOR | Remarks | +| --- | --- | --- | --- | +| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | +| Add a new nullable column to inner struct (at the end) | Yes | Yes | +| Add a new complex type field with default (map and array) | Yes | Yes | | +| Add a new nullable column and change the ordering of fields | Yes | Yes | | +| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | | +| Promote datatype from `int` to `long` for a field at root level | Yes | Yes | For other types, Hudi supports promotion as specified in [Avro schema resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). | +| Promote datatype from `int` to `long` for a nested field | Yes | Yes | +| Promote datatype from `int` to `long` for a complex type (value of map or array) | Yes | Yes | | +| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. | Review comment: Can we add another last column for notes. For eg, for this row, we can explain why this fails and what can user do to avoid this. I mean, I don't want to give a notion that this is not supported as of now
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379610#comment-17379610 ] ASF GitHub Bot commented on HUDI-1548: -- codope commented on pull request #3257: URL: https://github.com/apache/hudi/pull/3257#issuecomment-878800281 @vinothchandar @n3nash @nsivabalan Can you please review the doc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', pull-request-available, sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379006#comment-17379006 ] ASF GitHub Bot commented on HUDI-1548: -- codope opened a new pull request #3257: URL: https://github.com/apache/hudi/pull/3257 Add documentation for schema evolution with example. ## Verify this pull request Build the docs locally and verified the changes. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Labels: ', sev:high, user-support-issues > Fix For: 0.9.0 > > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273211#comment-17273211 ] sivabalan narayanan commented on HUDI-1548: --- [~nishith29]: Assigning the ticket to you. > Fix documentation around schema evolution > -- > > Key: HUDI-1548 > URL: https://issues.apache.org/jira/browse/HUDI-1548 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Major > Labels: user-support-issues > > Clearly call out what kind of schema evolution is supported by hudi in > documentation . > Context: https://github.com/apache/hudi/issues/2331 -- This message was sent by Atlassian Jira (v8.3.4#803005)