Arnab Karmakar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23906 )

Change subject: IMPALA-13122: Add detailed file metadata statistics to table 
loading logs
......................................................................


Patch Set 2:

(3 comments)

Id really appreciate some guidance here, the newly added tests failing in 
Jenkins with:
FileNotFoundError: [Errno 2] No such file or directory: 
'/home/ubuntu/Impala/logs/ee_tests/catalogd.INFO'

The method assert_catalogd_log_contains() looks for the logs in 
${IMPALA_HOME}/logs/ee_tests/catalogd.INFO but the catalogd logs are getting 
stored in ${IMPALA_HOME}/logs/cluster/catalogd.INFO. I tried creating symlinks 
but apparently its not working.

http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java:

http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@124
PS2, Line 124:     // Set of unique host indices (for HDFS/Ozone only)
             :     public Set<Integer> uniqueHostIndices = new HashSet<>();
             :     // Set of unique disk IDs (for HDFS/Ozone only)
             :     public Set<Short> uniqueDiskIds = new HashSet<>();
> I am not sure about the usefulness of treating these as two separate sets -
Should we change this to track host:disk pairs instead, or add it as an 
additional metric alongside the current ones?


http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@188
PS2, Line 188:       // Note: We cannot accurately update min/max values when 
removing stats
             :       // Remove hosts and disks from the sets if they belong to 
the removed files
> not sure if it worth it, but remove() could be implemented with counter map
Simple removeAll() is O(n), while counter maps may add memory overhead and code 
complexity. Im not sure about the tradeoff here, Ill let you decide which 
approach we should take.


http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@802
PS2, Line 802:       if (aggregatedStats.numFiles > 0) {
             :         statsLog.append("Files: 
").append(aggregatedStats.numFiles)
             :             .append(", Blocks: 
").append(aggregatedStats.numBlocks)
             :             .append(", Total size: ")
             :             
.append(PrintUtils.printBytes(aggregatedStats.totalFileBytes))
             :             .append(", File sizes (min/avg/max): ")
             :             
.append(PrintUtils.printBytes(aggregatedStats.minFileBytes)).append("/")
             :             
.append(PrintUtils.printBytes(aggregatedStats.getAvgFileBytes())).append("/")
             :             
.append(PrintUtils.printBytes(aggregatedStats.maxFileBytes));
             :
             :         // Modification time statistics (formatted as dates)
             :         if (aggregatedStats.minModificationTime != 
Long.MAX_VALUE &&
             :             aggregatedStats.maxModificationTime > 0) {
             :           statsLog.append(", Modification times (min/max): ")
             :               
.append(dateFormatter.format(java.time.Instant.ofEpochMilli(
             :                   aggregatedStats.minModificationTime)))
             :               .append("/")
             :               
.append(dateFormatter.format(java.time.Instant.ofEpochMilli(
             :                   aggregatedStats.maxModificationTime)));
             :         }
             :
             :         // Access time statistics (formatted as dates)
             :         // Note: Access time may be 0 or not updated if disabled 
in HDFS for performance
             :         if (aggregatedStats.minAccessTime != Long.MAX_VALUE &&
             :             aggregatedStats.maxAccessTime > 0) {
             :           statsLog.append(", Access times (min/max): ")
             :               
.append(dateFormatter.format(java.time.Instant.ofEpochMilli(
             :                   aggregatedStats.minAccessTime)))
             :               .append("/")
             :               
.append(dateFormatter.format(java.time.Instant.ofEpochMilli(
             :                   aggregatedStats.maxAccessTime)));
             :         }
             :
             :         // HDFS/Ozone specific stats
             :         if (aggregatedStats.getNumUniqueHosts() > 0) {
             :           statsLog.append(", Hosts: 
").append(aggregatedStats.getNumUniqueHosts());
             :         }
             :         if (aggregatedStats.getNumUniqueDisks() > 0) {
             :           statsLog.append(", Disks: 
").append(aggregatedStats.getNumUniqueDisks());
             :         }
             :       }
> This looks like something that could be moved to a function within FileMeta
Nice suggestion, will do it in the follow-up patch.



--
To view, visit http://gerrit.cloudera.org:8080/23906
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6f4592f173c047e5064058402f83be6d1f5c9a79
Gerrit-Change-Number: 23906
Gerrit-PatchSet: 2
Gerrit-Owner: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Pranav Lodha <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 27 Jan 2026 16:29:44 +0000
Gerrit-HasComments: Yes

Reply via email to