RussellSpitzer commented on a change in pull request #3432: URL: https://github.com/apache/iceberg/pull/3432#discussion_r741469738
########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. Review comment: I would make this even more explicit "When rows within a data file are deleted or updated, all rows within the original file will be rewritten into new files containing all of the original data but with the effected rows removed or modified." ########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. +- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. Review comment: similar comment to above : When rows within a data file are modified separate additional "delete" files are created containing information about which rows were modified in the original file. These delete files represent the delta between the original file and the actual state of the table. When the table is read in the future the "delete" files are read and their information is merged with the original data files to create the modified version of the file. ########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. Review comment: I would make this even more explicit "When rows within a data file are deleted or updated, all rows within the original file will be rewritten into new files containing all of the original data but with the effected rows removed or modified." Feel free to take or leave :) ########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. +- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. + +Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. Review comment: Would just skip the word Clearly here, computers are hard so it may not be clear :) Perhaps something like Copy on write provides better read performance: Copy on write provides better performance on reading because no additional files need to be read and merged to get the current state of the table but this comes at the cost of worse performance at write time. At write time Copy on Write must rewrite an entire file even if only a single row is changed within that file. Data was previously written and unmodified still must be rewritten if any adjacent row was modified. Merge on Read provides better write performance: Merge on Read only needs to write new files with the data that has changed in the table which means writing significantly less information than Copy on Write. Write performance is increased, but every read now requires reading not just the requested data files but also all delete files which apply to those data files. ########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. +- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. + +Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. +Users can choose to use **BOTH** CoW and MoR against the same Iceberg table based on different situations. +A common example is that, for a time-partitioned table, newer partitions with more frequent updates are maintained in the MoR approach through a CDC streaming pipeline, +and older partitions are maintained in the CoW way with less frequent GDPR updates from batch ETL jobs. + +## Copy-on-write + +As the definition states, given a user's update/delete requirement, the CoW write process would search for all the affected data files and perform rewrite. +Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark extensions. More details can be found in [Spark Writes](../spark-writes) page. + +## Merge-on-read + +In the next few sections, we provide more details around the Iceberg MoR design. Review comment: Could just skip this sentence ########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. +- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. + +Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. +Users can choose to use **BOTH** CoW and MoR against the same Iceberg table based on different situations. +A common example is that, for a time-partitioned table, newer partitions with more frequent updates are maintained in the MoR approach through a CDC streaming pipeline, +and older partitions are maintained in the CoW way with less frequent GDPR updates from batch ETL jobs. + +## Copy-on-write + +As the definition states, given a user's update/delete requirement, the CoW write process would search for all the affected data files and perform rewrite. +Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark extensions. More details can be found in [Spark Writes](../spark-writes) page. + +## Merge-on-read + +In the next few sections, we provide more details around the Iceberg MoR design. + +### Row-Level Delete File Spec Review comment: Do we want the word "Row-Level" here, I feel like it becomes slightly confusing when talking about "equality" deletes vs "position" deletes just because one literally has row information and the other does not. Maybe just Delete File Spec? ########## File path: site/docs/cow-and-mor.md ########## @@ -0,0 +1,195 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Copy-on-Write and Merge-on-Read + +This page explains the concept of copy-on-write and merge-on-read in the context of Iceberg to provide readers more clarity around Iceberg's table spec design. + +## Introduction + +In Iceberg, copy-on-write and merge-on-read are different ways to handle row-level update and delete operations. Here are their definitions: + +- **copy-on-write (CoW)**: an update/delete directly rewrites the entire affected data files. +- **merge-on-read (MoR)**: update/delete information is encoded in the form of delete files. The table reader can apply all delete information at read time. A compaction process takes care of merging delete files into data files asynchronously. + +Clearly, CoW is more efficient in reading data, but MoR is more efficient in writing data. +Users can choose to use **BOTH** CoW and MoR against the same Iceberg table based on different situations. +A common example is that, for a time-partitioned table, newer partitions with more frequent updates are maintained in the MoR approach through a CDC streaming pipeline, +and older partitions are maintained in the CoW way with less frequent GDPR updates from batch ETL jobs. + +## Copy-on-write + +As the definition states, given a user's update/delete requirement, the CoW write process would search for all the affected data files and perform rewrite. +Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark extensions. More details can be found in [Spark Writes](../spark-writes) page. + +## Merge-on-read + +In the next few sections, we provide more details around the Iceberg MoR design. + +### Row-Level Delete File Spec + +As documented in the [Spec](../spec/#row-level-deletes) page, Iceberg supports 2 different types of row-level delete files: **position deletes** and **equality deletes**. +If you are unfamiliar with these concepts, please read the related sections in the spec for more information before proceeding. + +Also note that because row-level delete files are valid Iceberg data files, each file must define the partition it belongs to. Review comment: this is a little confusing since we say "Must define the partition it belongs to" then follow up with "It can be unpartitioned" Maybe just instead say "each file reports the partition it was written for". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
