[GitHub] [hudi] nsivabalan commented on a change in pull request #3257: [HUDI-1548] Add documentation for schema evolution

GitBox Tue, 13 Jul 2021 16:18:17 -0700


nsivabalan commented on a change in pull request #3257:
URL: https://github.com/apache/hudi/pull/3257#discussion_r669171486




##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
    once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |

Review comment:
       I have some clarification wrt ordering change. will sync up directly. 
   basically, I wanna understand if you had tried this. 
   Commit1: which creates 2 base files.
   commit2 w/ diff ordering: updates just 1 base file out of 2. 
   Does read succeed now? 
   Also, we could try, where a new commit creates a completely new base file 
and does not touch any of the existing base files. 
   
   

##########
File path: docs/_docs/2_2_writing_data.md
##########
@@ -424,3 +424,192 @@ Here are some ways to efficiently manage the storage of 
your Hudi tables.
  - Intelligently tuning the [bulk insert 
parallelism](/docs/configurations.html#withBulkInsertParallelism), can again in 
nicely sized initial file groups. It is in fact critical to get this right, 
since the file groups
    once created cannot be deleted, but simply expanded as explained before.
  - For workloads with heavy updates, the [merge-on-read 
table](/docs/concepts.html#merge-on-read-table) provides a nice mechanism for 
ingesting quickly into smaller files and then later merging them into larger 
base files via compaction.
+
+
+## Schema Evolution
+
+Schema evolution is a very important aspect of data management. 
+Hudi supports common schema evolution scenarios, such as adding a nullable 
field or promoting a datatype of a field, out-of-the-box.
+Furthermore, the evolved schema is queryable across engines, such as Presto, 
Hive and Spark SQL.
+The following table presents a summary of the types of schema changes 
compatible with different Hudi table types.
+
+|  Schema Change  | COW | MOR | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means 
that a write with evolved schema succeeds and a read following the write 
succeeds to read entire dataset. |
+| Add a new nullable column to inner struct (at the end) | Yes | Yes |
+| Add a new complex type field with default (map and array) | Yes | Yes |  |
+| Add a new nullable column and change the ordering of fields | Yes | Yes |  |
+| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes 
|  |
+| Promote datatype from `int` to `long` for a field at root level | Yes | Yes 
| For other types, Hudi supports promotion as specified in [Avro schema 
resolution](http://avro.apache.org/docs/current/spec.html#Schema+Resolution). |
+| Promote datatype from `int` to `long` for a nested field | Yes | Yes |
+| Promote datatype from `int` to `long` for a complex type (value of map or 
array) | Yes | Yes |  |
+| Add a new non-nullable column at root level at the end | No | No | In case 
of MOR table with Spark data source, write succeeds but read fails. |

Review comment:
       Can we add another last column for notes. For eg, for this row, we can 
explain why this fails and what can user do to avoid this. I mean, I don't want 
to give a notion that this is not supported as of now and we might fix it 
later. Just wanted to make it clear that this is backwards incompatible change 
and is not allowed. Similarly for other rows that are applicable. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #3257: [HUDI-1548] Add documentation for schema evolution

Reply via email to