amogh-jahagirdar commented on code in PR #13061:
URL: https://github.com/apache/iceberg/pull/13061#discussion_r2114565265
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java:
##########
@@ -46,6 +47,7 @@ public static ManifestFileBean fromManifest(ManifestFile
manifest) {
bean.setAddedSnapshotId(manifest.snapshotId());
bean.setContent(manifest.content().id());
bean.setSequenceNumber(manifest.sequenceNumber());
+ bean.setFirstRowId(manifest.firstRowId());
Review Comment:
Ok @RussellSpitzer , I looked into it a bit more here. The issue I pointed
out earlier regarding
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java#L156
, is simply that when reading the manifests table into a dataframe, we cannot
deserialize the records from the manifests metadata table into
`ManifestFileBean` structure since the expectation is that it does exist in the
source records (even if it's null, but the column must exist).
As a result, I think the main part to decide on is do we want to continue
relying on this ManifestFileBean in the base spark procedure I linked or do we
want to compose a slightly slimmed down structure, which can have a slightly
more minimal field set with all the getters/setters.
My take is that we should just continue re-using it, and it's fine if some
fields don't have explicit getters since this means that it's not always
required from the `Encoder` perspective.
Additionally, we'll need to add first_row_id to the manifests metadata table
anyways. At that point, we _could_ add back the getter for this because then we
can project that field to read into this structure. But that still doesn't seem
quite right because many of the procedures don't even need to project that
field anyways.
TLDR, I think there are 3 options (I'm preferring 1):
1. Leave things as is, specific fields like first_row_id won't have explicit
getters because when deserializing into the bean via the `encoder` this creates
the expectation that the record must actually have that field, which isn't
always true.
2. Add a custom slimmed down version of manifestfilebean for the spark case
or the inverse, introduce a bean which composes the existing bean plus a
first_row_id for the distributed planning case.
3. Add back a getter when working on the manifest metadata table support for
first_row_id and change the projection in the spark procedure.
##########
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java:
##########
@@ -81,6 +87,40 @@ record ->
createRecord(SCHEMA, 103, "d", 3L, 1L),
createRecord(SCHEMA, 104, "e", 4L, 1L));
+ @Parameters(
+ name =
+ "catalogName = {0}, implementation = {1}, config = {2},"
+ + " format = {3}, vectorized = {4}, distributionMode = {5},"
+ + " fanout = {6}, branch = {7}, planningMode = {8},
formatVersion = {9}")
+ public static Object[][] parameters() {
+ return new Object[][] {
+ {
+ "testhadoop",
+ SparkCatalog.class.getName(),
+ ImmutableMap.of("type", "hadoop"),
+ FileFormat.PARQUET,
+ false,
+ WRITE_DISTRIBUTION_MODE_HASH,
+ true,
+ null,
+ LOCAL,
Review Comment:
@RussellSpitzer I looked into it a bit more and I think it'll be quite a bit
of work that I'm not sure is actually beneficial to slim down here. I think for
this class we know we want to vary the following parameters for testing:
1.) Planning mode
2.) File format
3.) Vectorized
4.) Distribution mode (this is mostly to make sure that regardless of plan,
we don't end up in situations where we drop the lineage fields, to make sure
the rewritten logical rules just work)
The part that we should be able to leave constant is all the catalog stuff,
but that's deep down in the inheritance path of this test. And I think we do
want to keep the inheritance from the existing row level operations tests not
only for the parameters but there are some shared helper methods for table
setup, view creation for the merge etc.
All in all, I'm also not super concerned about test times blowing up here
(this test suite takes around 30 seconds at the moment).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]