anoopj commented on code in PR #15049: URL: https://github.com/apache/iceberg/pull/15049#discussion_r3012527142
########## core/src/main/java/org/apache/iceberg/TrackedFile.java: ########## @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.nio.ByteBuffer; +import java.util.Collections; +import java.util.List; +import java.util.Set; +import org.apache.iceberg.stats.ContentStats; +import org.apache.iceberg.types.Types; + +/** A file tracked by a v4 manifest. */ +interface TrackedFile { + Types.NestedField TRACKING = Review Comment: Yes, there are two distinct sequence numbers in the `Tracking` struct: 1. `sequence_number`: data sequence number of the base file. Yes, this is what is used for equality delete matching 2. `file_sequence_number`: indicates when file was added. When column files are added, each column file would carry its own `file_sequence_number` within the column file struct. ########## core/src/main/java/org/apache/iceberg/Tracking.java: ########## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.nio.ByteBuffer; +import org.apache.iceberg.types.Types; + +/** Tracking information for a v4 manifest entry. */ +interface Tracking { + Types.NestedField STATUS = + Types.NestedField.required( + 0, + "status", + Types.IntegerType.get(), + "Entry status: 0=existing, 1=added, 2=deleted, 3=replaced"); + Types.NestedField SNAPSHOT_ID = + Types.NestedField.optional( + 1, + "snapshot_id", + Types.LongType.get(), + "Snapshot ID where the file was added or deleted"); + Types.NestedField SEQUENCE_NUMBER = + Types.NestedField.optional( + 3, "sequence_number", Types.LongType.get(), "Data sequence number of the file"); + Types.NestedField FILE_SEQUENCE_NUMBER = + Types.NestedField.optional( + 4, + "file_sequence_number", + Types.LongType.get(), + "File sequence number indicating when the file was added"); + Types.NestedField DV_SNAPSHOT_ID = + Types.NestedField.optional( + 5, + "dv_snapshot_id", + Types.LongType.get(), + "Snapshot ID where the DV was added; null if there is no DV"); + Types.NestedField FIRST_ROW_ID = + Types.NestedField.optional( + 142, "first_row_id", Types.LongType.get(), "ID of the first row in the data file"); + Types.NestedField DELETED_POSITIONS = Review Comment: Yes, these are the diff bitmaps for change detection in the tracking struct. They record what changed in the current snapshot: - `deleted_positions`: positions that were deleted (via a new DV) in this snapshot - `replaced_positions`: positions that were replaced n this snapshot Yes, we won't compute diff DVs for data DVs (because it will be prohibitively expensive). ########## api/src/main/java/org/apache/iceberg/FileContent.java: ########## @@ -18,11 +18,13 @@ */ package org.apache.iceberg; -/** Content type stored in a file, one of DATA, POSITION_DELETES, or EQUALITY_DELETES. */ +/** Content type stored in a file. */ public enum FileContent { DATA(0), POSITION_DELETES(1), - EQUALITY_DELETES(2); + EQUALITY_DELETES(2), + DATA_MANIFEST(3), + DELETE_MANIFEST(4); Review Comment: Yes, agreed. When a table is upgraded from v3, existing delete manifests will need to be referenced from root as `DELETE_MANIFEST` entries. ########## core/src/main/java/org/apache/iceberg/Tracking.java: ########## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.nio.ByteBuffer; +import org.apache.iceberg.types.Types; + +/** Tracking information for a v4 manifest entry. */ +interface Tracking { + Types.NestedField STATUS = + Types.NestedField.required( + 0, + "status", + Types.IntegerType.get(), + "Entry status: 0=existing, 1=added, 2=deleted, 3=replaced"); + Types.NestedField SNAPSHOT_ID = + Types.NestedField.optional( + 1, + "snapshot_id", + Types.LongType.get(), + "Snapshot ID where the file was added or deleted"); + Types.NestedField SEQUENCE_NUMBER = + Types.NestedField.optional( + 3, "sequence_number", Types.LongType.get(), "Data sequence number of the file"); + Types.NestedField FILE_SEQUENCE_NUMBER = + Types.NestedField.optional( + 4, + "file_sequence_number", + Types.LongType.get(), + "File sequence number indicating when the file was added"); + Types.NestedField DV_SNAPSHOT_ID = Review Comment: Really great questions again. `DV_SNAPSHOT_ID` is in `Tracking` rather than `DeletionVector` because the DV struct represents the serialized content (location, offset, size, cardinality) while tracking holds lifecycle metadata (snapshot IDs, sequence numbers, status). For column files in the future, I'd expect each column file to carry its own snapshot ID within the column file struct, similar to how it would carry its own file_sequence_number. The tracking struct tracks the lifecycle of the entry as a whole, while per-component metadata (when was this specific column file added) would live with that component's struct. @rdblue may have a more definitive opinion on this. I am happy to align on it when column files are designed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
