Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/22559 )
Change subject: IMPALA-11402: Add limit on files fetched by a single getPartialCatalogObject request ...................................................................... Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/22559/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22559/1//COMMIT_MSG@28 PS1, Line 28: Choose 4000000 as the default value for this new flag to leave some > Please clarify if the size of the file descriptor is constant in your assum That's a good point. The majority of the size of a file descriptor is the file name. Block locations are just host & disk indexes in integers so have trivial size. I'm using file names like "part-00001-53ba34b7-2285-4f3a-8d99-492f87e1fedc-f724dc37-964d-4ea5-afde-7754fc758e39.txt" which is the format for files generated by Spark. I think that's already long enough. Files generated by Impala have names using the query id and a numeric string, e.g. "cf4a7a47ca0b6b1c-6094b4b600000004_1614373804_data.0.parq" which are shorter. Files generated by Hive have names like "bucket_00000_0" which are even shorter. So I think 4M files is safe if the table doesn't have incremental stats. To be more accurate, we need to consider the size of partition-level tblproperties which store the incremental stats and customized key-values. For performance, it seems it's dominant by GC pause time in such a large scale. Tried to exclude the GC pause time, here are the time spent in catalogd side corresponding to the response size: * 371.71MB: 1s487ms * 744.51MB: 4s035ms * 1.09GB: 6s643ms It seems smaller response size is better. But it requires more round-trips between impalad and catalogd. Need more tests on this. -- To view, visit http://gerrit.cloudera.org:8080/22559 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibb13fec20de5a17e7fc33613ca5cdebb9ac1a1e5 Gerrit-Change-Number: 22559 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Fri, 28 Feb 2025 12:13:55 +0000 Gerrit-HasComments: Yes
