Re: [PR] Spark: Fix row lineage inheritance for distributed planning [iceberg]

via GitHub Thu, 29 May 2025 11:58:12 -0700


amogh-jahagirdar commented on code in PR #13061:
URL: https://github.com/apache/iceberg/pull/13061#discussion_r2114565265



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java:
##########
@@ -46,6 +47,7 @@ public static ManifestFileBean fromManifest(ManifestFile 
manifest) {
     bean.setAddedSnapshotId(manifest.snapshotId());
     bean.setContent(manifest.content().id());
     bean.setSequenceNumber(manifest.sequenceNumber());
+    bean.setFirstRowId(manifest.firstRowId());

Review Comment:
   Ok @RussellSpitzer , I looked into it a bit more here. The issue I pointed 
out earlier regarding 
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java#L156
 , is simply that when reading the manifests table into a dataframe, we cannot 
deserialize the records from the manifests metadata table into 
`ManifestFileBean` structure since the expectation is that it does exist in the 
source records (even if it's null, but the column must exist). 
   
   As a result, I think the main part to decide on is do we want to continue 
relying on this ManifestFileBean in the base spark procedure I linked or do we 
want to compose a slightly slimmed down structure, which can have a slightly 
more minimal field set with all the getters/setters. 
   
   My take is that we should just continue re-using it, and it's fine if some 
fields don't have explicit getters since this means that it's not always 
required from the `Encoder` perspective. 
   
   Additionally, we'll need to add first_row_id to the manifests metadata table 
anyways. At that point, we _could_ add back the getter for this because then we 
can project that field to read into this structure. But that still doesn't seem 
quite right because many of the procedures don't even need to project that 
field anyways.
   
   TLDR, I think there are 3 options (I'm preferring 1):
   
   1. Leave things as is, specific fields like first_row_id won't have explicit 
getters because when deserializing into the bean via the `encoder` this creates 
the expectation that the record must actually have that field, which isn't 
always true. 
   
   2. Add a custom slimmed down version of manifestfilebean for the spark case 
or the inverse, introduce a bean which composes the existing bean plus a 
first_row_id for the distributed planning case.
   
   3. Add back a getter when working on the manifest metadata table support for 
first_row_id and change the projection in the spark procedure. 



##########
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java:
##########
@@ -81,6 +87,40 @@ record ->
           createRecord(SCHEMA, 103, "d", 3L, 1L),
           createRecord(SCHEMA, 104, "e", 4L, 1L));
 
+  @Parameters(
+      name =
+          "catalogName = {0}, implementation = {1}, config = {2},"
+              + " format = {3}, vectorized = {4}, distributionMode = {5},"
+              + " fanout = {6}, branch = {7}, planningMode = {8}, 
formatVersion = {9}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+      {
+        "testhadoop",
+        SparkCatalog.class.getName(),
+        ImmutableMap.of("type", "hadoop"),
+        FileFormat.PARQUET,
+        false,
+        WRITE_DISTRIBUTION_MODE_HASH,
+        true,
+        null,
+        LOCAL,

Review Comment:
   @RussellSpitzer I looked into it a bit more and I think it'll be quite a bit 
of work that I'm not sure is actually beneficial to slim down here. I think for 
this class we know we want to vary the following parameters for testing:
   
   1.) Planning mode
   2.) File format
   3.) Vectorized
   4.) Distribution mode (this is mostly to make sure that regardless of plan, 
we don't end up in situations where we drop the lineage fields, to make sure 
the rewritten logical rules just work)
   
   
   The part that we should be able to leave constant is all the catalog stuff, 
but that's deep down in the inheritance path of this test. And I think we do 
want to keep the inheritance from the existing row level operations tests not 
only for the parameters but there are some shared helper methods for table 
setup, view creation for the merge etc.
   
   All in all, I'm also not super concerned about test times blowing up here 
(this test suite takes around 30 seconds at the moment).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Fix row lineage inheritance for distributed planning [iceberg]

Reply via email to