stevenzwu commented on code in PR #15049:
URL: https://github.com/apache/iceberg/pull/15049#discussion_r3012302681


##########
core/src/main/java/org/apache/iceberg/Tracking.java:
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.nio.ByteBuffer;
+import org.apache.iceberg.types.Types;
+
+/** Tracking information for a v4 manifest entry. */
+interface Tracking {
+  Types.NestedField STATUS =
+      Types.NestedField.required(
+          0,
+          "status",
+          Types.IntegerType.get(),
+          "Entry status: 0=existing, 1=added, 2=deleted, 3=replaced");
+  Types.NestedField SNAPSHOT_ID =
+      Types.NestedField.optional(
+          1,
+          "snapshot_id",
+          Types.LongType.get(),
+          "Snapshot ID where the file was added or deleted");
+  Types.NestedField SEQUENCE_NUMBER =
+      Types.NestedField.optional(
+          3, "sequence_number", Types.LongType.get(), "Data sequence number of 
the file");
+  Types.NestedField FILE_SEQUENCE_NUMBER =
+      Types.NestedField.optional(
+          4,
+          "file_sequence_number",
+          Types.LongType.get(),
+          "File sequence number indicating when the file was added");
+  Types.NestedField DV_SNAPSHOT_ID =
+      Types.NestedField.optional(
+          5,
+          "dv_snapshot_id",
+          Types.LongType.get(),
+          "Snapshot ID where the DV was added; null if there is no DV");
+  Types.NestedField FIRST_ROW_ID =
+      Types.NestedField.optional(
+          142, "first_row_id", Types.LongType.get(), "ID of the first row in 
the data file");
+  Types.NestedField DELETED_POSITIONS =

Review Comment:
   what are the purpose of deleted and replaced positions? are they for change 
detection?
   
   We talked about `diff DV` for manifest DV before. are these two fields for 
that purpose?
   
   For data DVs (in Puffin files), I assume we won't compute and store diff DVs 
during write.



##########
core/src/main/java/org/apache/iceberg/Tracking.java:
##########
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.nio.ByteBuffer;
+import org.apache.iceberg.types.Types;
+
+/** Tracking information for a v4 manifest entry. */
+interface Tracking {
+  Types.NestedField STATUS =
+      Types.NestedField.required(
+          0,
+          "status",
+          Types.IntegerType.get(),
+          "Entry status: 0=existing, 1=added, 2=deleted, 3=replaced");
+  Types.NestedField SNAPSHOT_ID =
+      Types.NestedField.optional(
+          1,
+          "snapshot_id",
+          Types.LongType.get(),
+          "Snapshot ID where the file was added or deleted");
+  Types.NestedField SEQUENCE_NUMBER =
+      Types.NestedField.optional(
+          3, "sequence_number", Types.LongType.get(), "Data sequence number of 
the file");
+  Types.NestedField FILE_SEQUENCE_NUMBER =
+      Types.NestedField.optional(
+          4,
+          "file_sequence_number",
+          Types.LongType.get(),
+          "File sequence number indicating when the file was added");
+  Types.NestedField DV_SNAPSHOT_ID =

Review Comment:
   should `DV_SNAPSHOT_ID` be part of `DeletionVector` schema? 
   
   If we add column files in the future, where should we track the column file 
snapshot id? part of column file struct or the tracking struct here?



##########
api/src/main/java/org/apache/iceberg/FileContent.java:
##########
@@ -18,11 +18,13 @@
  */
 package org.apache.iceberg;
 
-/** Content type stored in a file, one of DATA, POSITION_DELETES, or 
EQUALITY_DELETES. */
+/** Content type stored in a file. */
 public enum FileContent {
   DATA(0),
   POSITION_DELETES(1),
-  EQUALITY_DELETES(2);
+  EQUALITY_DELETES(2),
+  DATA_MANIFEST(3),
+  DELETE_MANIFEST(4);

Review Comment:
   We probably need `DELETE_MANIFEST` entries in the root manifest file (even 
without equality deletes), since V3 to V4 upgrade doesn't require rewriting 
existing data and delete manifest files to colocate the DVs.



##########
core/src/main/java/org/apache/iceberg/TrackedFile.java:
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.nio.ByteBuffer;
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import org.apache.iceberg.stats.ContentStats;
+import org.apache.iceberg.types.Types;
+
+/** A file tracked by a v4 manifest. */
+interface TrackedFile {
+  Types.NestedField TRACKING =

Review Comment:
   > The fields aren't identical to DataFile: TrackedFile uses location instead 
of file_path, content_type includes manifest types , and stats are captured 
differently.
   
   This makes sense.
   
   > Tracking info (status, sequence numbers, snapshot IDs) applies to the 
entry as a whole, not just to the content file , so it naturally lives at the 
same level.
   
   Do sequence numbers and snapshot IDs apply to the entry as a whole? Use 
sequence number as an example, there are two sequence numbers in my mind.
   
   * Row level sequence number which maps to the data sequence number for the 
base file. Equality deletes matching should use this sequence number
   * last updated sequence number for row lineage purpose. It should be the 
sequence number of the latest column file was added (inheritance) or persisted 
values in the column file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to