[Impala-ASF-CR] IMPALA-10914: Consistently schedule scan ranges for Iceberg tables

Zoltan Borok-Nagy (Code Review) Thu, 23 Sep 2021 06:35:18 -0700

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17857 )


Change subject: IMPALA-10914: Consistently schedule scan ranges for Iceberg 
tables
......................................................................


Patch Set 2:

(19 comments)

Thanks everyone for the comments!

http://gerrit.cloudera.org:8080/#/c/17857/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17857/1//COMMIT_MSG@11
PS1, Line 11:  and the HDFS
            : block locations were not consistent across the reload
> +1.
When we retrieve the file status from HDFS we get block location information as 
well. I.e. for each block we get a list about where can we find the replicas. 
The order of the elements in this list is random for each file status call.

The order of the block locations matter during scheduling. In other words, even 
if the state of HDFS blocks remain the same, Impala might schedule scan ranges 
differently because it will see the block locations in different order.


http://gerrit.cloudera.org:8080/#/c/17857/1//COMMIT_MSG@25
PS1, Line 25: Fixing the above revealed another bug. Before this patch we didn't
            : handle self-events of Iceberg tables.
> Can you add more information about this? For example what events are genera
I extended the following sentence.


http://gerrit.cloudera.org:8080/#/c/17857/1//COMMIT_MSG@31
PS1, Line 31: via the event notification mechanism. I fixed this by cr
> +1 if we have native version number support in Iceberg it is better to use
Iceberg tables can be stored in different Iceberg Catalogs. One of it is 
HiveCatalog which uses HMS to store information about Iceberg tables. Most 
importantly it stores snapshot information in table property 
'metadata_location'. We could check if it has changed.

But if the Iceberg table is stored in HadoopTables/HadoopCatalog, then we would 
need to do expensive IO operations to determine the snapshot ID.

I think we could use the snapshot ID in IcebergTable.load() to avoid reloading 
a lot of information. But we still need to detect self-events because the 
catalogVersion of the table gets updated and it can cause exceptions in local 
catalog mode (InconsistentMetadataFetchException with message "Catalog object 
%s changed version between accesses.").

I'm looking forward to the overhaul of self-events logic and how it will work 
with IcebergTables in HiveCatalog (where we don't have control of HMS RPCs).


http://gerrit.cloudera.org:8080/#/c/17857/1//COMMIT_MSG@34
PS1, Line 34: Iceberg has to embed all table modifications in a single ALTER 
TABLE
> It is a bit tricky, but self-event handling can be also tested, see https:/
Added e2e tests.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@348
PS1, Line 348: hdfsTable_
             :             .load(false, msClient, msTable_, true, true, false, 
null, null,null, reason);
             :         pathHashToFileDescMap_ = Utils.loadAllPartition(this);
> This is probably unrelated to this patch I was curious to understand what i
Yeah, it's on the radar: IMPALA-10254

Currently metadata handling of Iceberg tables is quite ugly and redundant, 
especially file metadata.

We load all the file descriptors recursively here via file listings. Then we 
also load the file descriptors one-by-one...

I think as a first step we should at least reuse the file descriptors of the 
underlying hdfsTable_ because the recursive listing is probably more efficient 
than the file status RPCs one-by-one...

But the ideal would be to refactor Iceberg tables, hopefully get rid of the 
underlying HdfsTable. And most importantly create file descriptors from the 
information in Iceberg metadata files. Though the latter might not achievable 
when HDFS is used to store data files because we need block location 
information. But on S3 we should only use Iceberg metadata. The only challenge 
there is to come up with a "modification time" of the files which is needed for 
the remote data cache (hopefully we can use the snapshot time if noone verifies 
it).


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@481
PS1, Line 481:     TGetPartialCatalogObjectResponse resp =
             :         getHdfsTable().getPartialInfo(req, missingPartialInfos);
             :     if (req.table_info_selector.want_iceberg_snapshot) {
             :       resp.table_info.setIceberg_snapshot(
             :           FeIcebergTable.Utils.createTIcebergSnapshot(this));
             :     }
> are we sending some file descriptors twice here? Some from the hdfsTable an
Yes, we send all data file descriptors twice. Let me deal with it in a separate 
patch.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergCatalogs.java
File fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergCatalogs.java:

http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergCatalogs.java@104
PS1, Line 104:     properties.setProperty("external.table.purge", "TRUE");
> Is this change related to the ones mentioned in the commit message?
Yes, because before this change we issued an ALTER TABLE statement to set this 
property. Which we processed later during event processing.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergCatalogs.java@151
PS1, Line 151:     properties.setProperty(IcebergTable.ICEBERG_CATALOG,
             :                            
tableProps.get(IcebergTable.ICEBERG_CATALOG));
> Is this change related to the ones mentioned in the commit message?
Yes, because in the past we set it in a separate ALTER TABLE.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/local/LocalIcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/local/LocalIcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/local/LocalIcebergTable.java@110
PS1, Line 110:           "Failed to load table: %s.%s", msTable.getDbName(), 
msTable.getTableName()), e);
> nit probably should include the snapshot id in the message, if it is not in
We would also retrieve the snapshot id in loadIcebergSnapshot().


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/catalog/local/LocalIcebergTable.java@116
PS1, Line 116: pathHashToFileDescMap_
> Am I correct in understanding that pathHashToFileDescMap_ is always loaded
Yes, we shouldn't load file descriptors twice, but currently things break if we 
don't do that. Let me deal with it in a separate patch, probably in the context 
of IMPALA-10254.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1227
PS1, Line 1227: a
> nit. an
Done


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1268
PS1, Line 1268:           addSummary(response, "Updated table.");
> Updated table after set table properties
I used the same message that we use at L1079.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1272
PS1, Line 1272: Updated table.
> Updated table after unset table properties
I used the same message that we use at L1090.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@1295
PS1, Line 1295:     if (!needsToUpdateHms) {
              :       loadTableMetadata(tbl, newCatalogVersion, true, true, 
null, "ALTER Iceberg TABLE " +
              :           params.getAlter_type().name());
              :       catalog_.addVersionsForInflightEvents(false, tbl, 
newCatalogVersion);
              :       addTableToCatalogUpdate(tbl, wantMinimalResult, 
response.result);
              :       return false;
              :     }
> Should this block of code be executed within 1283-1286 too? That is, if txn
SET PARTITION SPEC doesn't use an Iceberg transaction and we don't need to 
update HMS either.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@2915
PS1, Line 2915: IcebergCatalogOpExecutor.addCatalogVersionToTxn(iceTxn,
> It is not clear to me why do we need to increment the catalog version here
dropTableStats() issues an ALTER TABLE statement, then the 
iceTxn.commitTransaction() will issue another ALTER TABLE statement, therefore 
we need to invoke catalog_.addVersionsForInflightEvents() twice (once at L2914 
and once at L2773).

It looked reasonable to me to use two different catalogVersions, but seems like 
using the same catalogVersion twice also works.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@2918
PS1, Line 2918:     iceTxn.commitTransaction();
              :     return newCatalogVersion;
> Can you move these statements into a finally block and avoid code duplicati
Actually we can remove these two lines, since after the if stmt we have the 
same two lines.

I'm not sure about putting commitTransaction() to a finally block. I'd think we 
don't want to commit in case of exceptions.


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@6070
PS1, Line 6070:
> Can we add test coverage for this in the test_self_events?
Added a test to test_events_custom_configs.py


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java@113
PS1, Line 113: addColumn
> nit. addColumns()?
Done


http://gerrit.cloudera.org:8080/#/c/17857/1/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java@339
PS1, Line 339: table properties usi
> nit. update property in 'txn'
Updated the comment.



--
To view, visit http://gerrit.cloudera.org:8080/17857
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibb8216b37d350469b573dad7fcefdd0ee0599ed5
Gerrit-Change-Number: 17857
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 23 Sep 2021 13:34:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10914: Consistently schedule scan ranges for Iceberg tables

Reply via email to