[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-08-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402472#comment-17402472
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

vinothchandar merged pull request #3512:
URL: https://github.com/apache/hudi/pull/3512


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-08-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402074#comment-17402074
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

codope opened a new pull request #3512:
URL: https://github.com/apache/hudi/pull/3512


   ## What is the purpose of the pull request
   
   Move the current [schema evolution 
section](https://hudi.apache.org/docs/next/writing_data#schema-evolution) from 
`Writing Data` page to its own page in the Docs.
   
   ## Verify this pull request
   
   Verified the change locally.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385241#comment-17385241
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan merged pull request #3257:
URL: https://github.com/apache/hudi/pull/3257


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384528#comment-17384528
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r673561465



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |

Review comment:
   awesome. in line with my understanding. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384529#comment-17384529
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r673563019



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |
+| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes 
|  |
+| Promote datatype from `int` to `long` for a field at root level | Yes | Yes 
| For other types, Hudi supports promotion as specified in [Avro schema 
resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). |
+| Promote datatype from `int` to `long` for a nested field | Yes | Yes |
+| Promote datatype from `int` to `long` for a complex type (value of map or 
array) | Yes | Yes |  |
+| Add a new non-nullable column at root level at the end | No | No | In case 
of MOR table with Spark data source, write succeeds but read fails. |
+| Add a new non-nullable column to inner struct (at the end) | No | No |  |
+| Change datatype from `long` to `int` for a nested field | No | No |  |
+| Change datatype from `long` to `int` for a complex type (value of map or 
array) | No | No |  |
+
+Let us walk through an example to demonstrate the schema evolution support in 
Hudi. 
+In the below example, we are going to add a new string field and change the 
datatype of a field from int to long.
+
+```java
+Welcome to
+  __
+/ __/__  ___ _/ /__
+_\ \/ _ \/ _ `/ __/  '_/
+/___/ .__/\_,_/_/ /_/\_\   version 3.1.2
+/_/
+
+Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
+Type in expressions to have them evaluated.
+Type :help for more information.
+
+scala> import org.apache.hudi.QuickstartUtils._
+import org.apache.hudi.QuickstartUtils._
+
+scala> import scala.collection.JavaConversions._
+import scala.collection.JavaConversions._
+
+scala> import org.apache.spark.sql.SaveMode._
+import org.apache.spark.sql.SaveMode._
+
+scala> import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceReadOptions._
+
+scala> import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+
+scala> import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+scala> import org.apache.spark.sql.types._
+import org.apache.spark.sql.types._
+
+scala> import org.apache.spark.sql.Row
+import org.apache.spark.sql.Row
+
+scala> val tableName = "hudi_trips_cow"
+tableName: String = hudi_trips_cow
+scala> val basePath = "file:///tmp/hudi_trips_cow"
+basePath: String = file:///tmp/hudi_trips_cow
+scala> val schema = StructType( Array(
+| StructField("rowId", StringType,true),
+| StructField("partitionId", StringType,true),
+| StructField("preComb", LongType,true),
+| StructField("name", StringType,true),
+| StructField("versionId", StringType,true),
+| StructField("intToLong", IntegerType,true)
+| ))
+schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(rowId,StringType,true), 
StructField(partitionId,StringType,true), StructField(preComb,LongType,true), 
StructField(name,StringType,true), StructField(versionId,StringType

[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384341#comment-17384341
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

codope commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r673228670



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |

Review comment:
   Ran this test with Spark datasource. The write succeeds but read failed 
with below stacktrace:
   ```
   java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
 at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
 at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
 at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:131)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/h

[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383129#comment-17383129
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

danny0405 commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r672081905



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |
+| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes 
|  |
+| Promote datatype from `int` to `long` for a field at root level | Yes | Yes 
| For other types, Hudi supports promotion as specified in [Avro schema 
resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). |
+| Promote datatype from `int` to `long` for a nested field | Yes | Yes |
+| Promote datatype from `int` to `long` for a complex type (value of map or 
array) | Yes | Yes |  |
+| Add a new non-nullable column at root level at the end | No | No | In case 
of MOR table with Spark data source, write succeeds but read fails. |
+| Add a new non-nullable column to inner struct (at the end) | No | No |  |
+| Change datatype from `long` to `int` for a nested field | No | No |  |
+| Change datatype from `long` to `int` for a complex type (value of map or 
array) | No | No |  |
+
+Let us walk through an example to demonstrate the schema evolution support in 
Hudi. 
+In the below example, we are going to add a new string field and change the 
datatype of a field from int to long.
+
+```java
+Welcome to
+  __
+/ __/__  ___ _/ /__
+_\ \/ _ \/ _ `/ __/  '_/
+/___/ .__/\_,_/_/ /_/\_\   version 3.1.2
+/_/
+
+Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
+Type in expressions to have them evaluated.
+Type :help for more information.
+
+scala> import org.apache.hudi.QuickstartUtils._
+import org.apache.hudi.QuickstartUtils._
+
+scala> import scala.collection.JavaConversions._
+import scala.collection.JavaConversions._
+
+scala> import org.apache.spark.sql.SaveMode._
+import org.apache.spark.sql.SaveMode._
+
+scala> import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceReadOptions._
+
+scala> import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+
+scala> import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+scala> import org.apache.spark.sql.types._
+import org.apache.spark.sql.types._
+
+scala> import org.apache.spark.sql.Row
+import org.apache.spark.sql.Row
+
+scala> val tableName = "hudi_trips_cow"
+tableName: String = hudi_trips_cow
+scala> val basePath = "file:///tmp/hudi_trips_cow"
+basePath: String = file:///tmp/hudi_trips_cow
+scala> val schema = StructType( Array(
+| StructField("rowId", StringType,true),
+| StructField("partitionId", StringType,true),
+| StructField("preComb", LongType,true),
+| StructField("name", StringType,true),
+| StructField("versionId", StringType,true),
+| StructField("intToLong", IntegerType,true)
+| ))
+schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(rowId,StringType,true), 
StructField(partitionId,StringType,true), StructField(preComb,LongType,true), 
StructField(name,StringType,true), StructField(versionId,StringType,

[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383124#comment-17383124
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

danny0405 commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r672074150



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |
+| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes 
|  |
+| Promote datatype from `int` to `long` for a field at root level | Yes | Yes 
| For other types, Hudi supports promotion as specified in [Avro schema 
resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). |
+| Promote datatype from `int` to `long` for a nested field | Yes | Yes |
+| Promote datatype from `int` to `long` for a complex type (value of map or 
array) | Yes | Yes |  |
+| Add a new non-nullable column at root level at the end | No | No | In case 
of MOR table with Spark data source, write succeeds but read fails. |

Review comment:
   Hmm, i plan to add a new metadata column in HUDI-1771 named 
`_hoodie_operation` to record the change flag. Very good news to see that this 
expects to be compatible.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383099#comment-17383099
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

danny0405 commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r672063103



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |

Review comment:
   Thanks, let us take a look ~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382250#comment-17382250
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

vinothchandar commented on pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#issuecomment-881621090


   @umehrot2 @rmpifer  can you please review this? may be in context of some 
other issues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380889#comment-17380889
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r670009907



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |

Review comment:
   I don't mean to drag this patch per se. But something to keep in mind 
and get it done when someone has cycles. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380886#comment-17380886
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r670009217



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |

Review comment:
   @danny0405 @yanghua @leesf : May be someone from flink can do a similar 
exercise (try out all these) and certify.  We can add the same in line 433. Or 
add another column to call out the engines where certain schema evolution 
works. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380887#comment-17380887
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r670009217



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |

Review comment:
   @danny0405 @yanghua @leesf : May be someone from flink can do a similar 
exercise (try out all these) and certify.  We can add "flink" to line 433. Or 
add another column to call out the engines where certain schema evolution 
works. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380218#comment-17380218
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r669171486



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |

Review comment:
   I have some clarification wrt ordering change. will sync up directly. 
   basically, I wanna understand if you had tried this. 
   Commit1: which creates 2 base files.
   commit2 w/ diff ordering: updates just 1 base file out of 2. 
   Does read succeed now? 
   Also, we could try, where a new commit creates a completely new base file 
and does not touch any of the existing base files. 
   
   

##
File path: docs/_docs/2_2_writing_data.md
##
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  ---  | ---  | --- | --- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |
+| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes 
|  |
+| Promote datatype from `int` to `long` for a field at root level | Yes | Yes 
| For other types, Hudi supports promotion as specified in [Avro schema 
resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). |
+| Promote datatype from `int` to `long` for a nested field | Yes | Yes |
+| Promote datatype from `int` to `long` for a complex type (value of map or 
array) | Yes | Yes |  |
+| Add a new non-nullable column at root level at the end | No | No | In case 
of MOR table with Spark data source, write succeeds but read fails. |

Review comment:
   Can we add another last column for notes. For eg, for this row, we can 
explain why this fails and what can user do to avoid this. I mean, I don't want 
to give a notion that this is not supported as of now

[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379610#comment-17379610
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

codope commented on pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#issuecomment-878800281


   @vinothchandar @n3nash @nsivabalan Can you please review the doc? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', pull-request-available, sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-07-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379006#comment-17379006
 ] 

ASF GitHub Bot commented on HUDI-1548:
--

codope opened a new pull request #3257:
URL: https://github.com/apache/hudi/pull/3257


   Add documentation for schema evolution with example.
   
   ## Verify this pull request
   
   Build the docs locally and verified the changes.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: ', sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1548) Fix documentation around schema evolution

2021-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273211#comment-17273211
 ] 

sivabalan narayanan commented on HUDI-1548:
---

[~nishith29]: Assigning the ticket to you. 

> Fix documentation around schema evolution 
> --
>
> Key: HUDI-1548
> URL: https://issues.apache.org/jira/browse/HUDI-1548
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
>
> Clearly call out what kind of schema evolution is supported by hudi in 
> documentation .
> Context: https://github.com/apache/hudi/issues/2331



--
This message was sent by Atlassian Jira
(v8.3.4#803005)