[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3432: Doc: add a page to explain merge-on-read

GitBox Tue, 02 Nov 2021 14:52:49 -0700


RussellSpitzer commented on a change in pull request #3432:
URL: https://github.com/apache/iceberg/pull/3432#discussion_r741469738




##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.

Review comment:
       I would make this even more explicit
   
   "When rows within a data file are deleted or updated, all rows within the 
original file will be rewritten into new files containing all of the original 
data but with the effected rows removed or modified."

##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.
+- **merge-on-read (MoR)**: update/delete information is encoded in the form of 
delete files. The table reader can apply all delete information at read time. A 
compaction process takes care of merging delete files into data files 
asynchronously. 

Review comment:
       similar comment to above : When rows within a data file are modified 
separate additional "delete" files are created containing information about 
which rows were modified in the original file. These delete files represent the 
delta between the original file and the actual state of the table. When the 
table is read in the future the "delete" files are read and their information 
is merged with the original data files to create the modified version of the 
file.
   

##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.

Review comment:
       I would make this even more explicit
   
   "When rows within a data file are deleted or updated, all rows within the 
original file will be rewritten into new files containing all of the original 
data but with the effected rows removed or modified."
   
   Feel free to take or leave :)

##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.
+- **merge-on-read (MoR)**: update/delete information is encoded in the form of 
delete files. The table reader can apply all delete information at read time. A 
compaction process takes care of merging delete files into data files 
asynchronously. 
+
+Clearly, CoW is more efficient in reading data, but MoR is more efficient in 
writing data.

Review comment:
       Would just skip the word Clearly here, computers are hard so it may not 
be clear :)
   
   Perhaps something like
   
   Copy on write provides better read performance:
   
   Copy on write provides better performance on reading because no additional 
files need to be read and merged to get the current state of the table but this 
comes at the cost of worse performance at write time. At write time Copy on 
Write must rewrite an entire file even if only a single row is changed within 
that file. Data was previously written and unmodified still must be rewritten 
if any adjacent row was modified.
   
   Merge on Read provides better write performance:
   
   Merge on Read only needs to write new files with the data that has changed 
in the table which means writing significantly less information than Copy on 
Write. Write performance is increased, but every read now requires reading not 
just the requested data files but also all delete files which apply to those 
data files.

##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.
+- **merge-on-read (MoR)**: update/delete information is encoded in the form of 
delete files. The table reader can apply all delete information at read time. A 
compaction process takes care of merging delete files into data files 
asynchronously. 
+
+Clearly, CoW is more efficient in reading data, but MoR is more efficient in 
writing data.
+Users can choose to use **BOTH** CoW and MoR against the same Iceberg table 
based on different situations. 
+A common example is that, for a time-partitioned table, newer partitions with 
more frequent updates are maintained in the MoR approach through a CDC 
streaming pipeline,
+and older partitions are maintained in the CoW way with less frequent GDPR 
updates from batch ETL jobs.
+
+## Copy-on-write
+
+As the definition states, given a user's update/delete requirement, the CoW 
write process would search for all the affected data files and perform rewrite.
+Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark 
extensions. More details can be found in [Spark Writes](../spark-writes) page.
+
+## Merge-on-read
+
+In the next few sections, we provide more details around the Iceberg MoR 
design.

Review comment:
       Could just skip this sentence

##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.
+- **merge-on-read (MoR)**: update/delete information is encoded in the form of 
delete files. The table reader can apply all delete information at read time. A 
compaction process takes care of merging delete files into data files 
asynchronously. 
+
+Clearly, CoW is more efficient in reading data, but MoR is more efficient in 
writing data.
+Users can choose to use **BOTH** CoW and MoR against the same Iceberg table 
based on different situations. 
+A common example is that, for a time-partitioned table, newer partitions with 
more frequent updates are maintained in the MoR approach through a CDC 
streaming pipeline,
+and older partitions are maintained in the CoW way with less frequent GDPR 
updates from batch ETL jobs.
+
+## Copy-on-write
+
+As the definition states, given a user's update/delete requirement, the CoW 
write process would search for all the affected data files and perform rewrite.
+Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark 
extensions. More details can be found in [Spark Writes](../spark-writes) page.
+
+## Merge-on-read
+
+In the next few sections, we provide more details around the Iceberg MoR 
design.
+
+### Row-Level Delete File Spec

Review comment:
       Do we want the word "Row-Level" here, I feel like it becomes slightly 
confusing when talking about "equality" deletes vs "position" deletes just 
because one literally has row information and the other does not. Maybe just 
Delete File Spec?

##########
File path: site/docs/cow-and-mor.md
##########
@@ -0,0 +1,195 @@
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Copy-on-Write and Merge-on-Read
+
+This page explains the concept of copy-on-write and merge-on-read in the 
context of Iceberg to provide readers more clarity around Iceberg's table spec 
design.
+
+## Introduction
+
+In Iceberg, copy-on-write and merge-on-read are different ways to handle 
row-level update and delete operations. Here are their definitions:
+
+- **copy-on-write (CoW)**: an update/delete directly rewrites the entire 
affected data files.
+- **merge-on-read (MoR)**: update/delete information is encoded in the form of 
delete files. The table reader can apply all delete information at read time. A 
compaction process takes care of merging delete files into data files 
asynchronously. 
+
+Clearly, CoW is more efficient in reading data, but MoR is more efficient in 
writing data.
+Users can choose to use **BOTH** CoW and MoR against the same Iceberg table 
based on different situations. 
+A common example is that, for a time-partitioned table, newer partitions with 
more frequent updates are maintained in the MoR approach through a CDC 
streaming pipeline,
+and older partitions are maintained in the CoW way with less frequent GDPR 
updates from batch ETL jobs.
+
+## Copy-on-write
+
+As the definition states, given a user's update/delete requirement, the CoW 
write process would search for all the affected data files and perform rewrite.
+Spark supports CoW `DELETE`, `UPDATE` and `MERGE` operations through Spark 
extensions. More details can be found in [Spark Writes](../spark-writes) page.
+
+## Merge-on-read
+
+In the next few sections, we provide more details around the Iceberg MoR 
design.
+
+### Row-Level Delete File Spec
+
+As documented in the [Spec](../spec/#row-level-deletes) page, Iceberg supports 
2 different types of row-level delete files: **position deletes** and 
**equality deletes**.
+If you are unfamiliar with these concepts, please read the related sections in 
the spec for more information before proceeding.
+
+Also note that because row-level delete files are valid Iceberg data files, 
each file must define the partition it belongs to.

Review comment:
       this is a little confusing since we say "Must define the partition it 
belongs to" then follow up with "It can be unpartitioned"
   
   Maybe just instead say "each file reports the partition it was written for".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3432: Doc: add a page to explain merge-on-read

Reply via email to