[GitHub] [iceberg] kbendick commented on pull request #4395: Spark: Add custom metric for number of files read by a SparkScan

GitBox Mon, 28 Mar 2022 17:43:27 -0700


kbendick commented on pull request #4395:
URL: https://github.com/apache/iceberg/pull/4395#issuecomment-1081286111



   > > Note that in order for the Spark UI to show the value of the custom 
metric, the custom metric class must be available in the classpath of the Spark 
History Server. The simplest way to do this is to put the Iceberg Spark runtime 
JAR in the SHS classpath. Spark's `SQLAppStatusListener` loads the custom 
metric class by name, instantiates it, and calls its `aggregateTaskMetrics` 
method to get the value of the metric. If it is not able to load the custom 
metric class, it shows "N/A".
   > 
   > Yeah I was very frustrated with this when i was messing with it. Makes it 
hard if your history server is not running the same Spark build as your apps :(
   
   This qualification for the history server would be a really good thing to 
call out in the docs somewhere if we merge this.
   
   >> In the current code we would get incorrect results for number of data 
files if our split size is smaller than our file size. Imagine we have a 512 mb 
file and 128 mb split size (and row group) this will generate 4 separate tasks 
for the same Data file. Counting the file 4 times.
   
   > That is a good point. Even so, such a count is still a useful metric, in 
that it is a measure of amount of work to do, although "number of files" is not 
the correct name for it. Of course, the actual number of files is a metric 
we're interested in as well.
   
   I also think there's value in this. Would it just be better to call it 
"number of splits read" or something? Provided we can give it a name that 
clearly delineates what is being measured, I think this has value.
   
   Though ideally metrics could be exported from the driver as well (which 
would need to be added to Spark).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on pull request #4395: Spark: Add custom metric for number of files read by a SparkScan

Reply via email to