[ 
https://issues.apache.org/jira/browse/PARQUET-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787369#comment-17787369
 ] 

ASF GitHub Bot commented on PARQUET-2374:
-----------------------------------------

parthchandra commented on PR #1187:
URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816986727

   @steveloughran I did look into leveraging Hadoop io stats  but my first 
attempt did not work too well and I thought a simpler initial implementation 
would be more useful. Once we move to hadoop vector io, I'll take another stab 
at it. 
   
   > What would be good if this stats was set up to
   > 
   > take maps of key-value rather than a fixed enum
   
   The fixed enum here is simply the Parquet file reader providing information 
that these are the values it knows about. This implementation is not really 
collecting and aggregating anything, it is simply recording the time and counts 
and passing them on. 
    
   > collect those min/mean/max as well as counts.
   
   The implementation of the parquet metrics callback will do that. So if the 
execution engine is Spark, it can simply get the values and add them to it's 
own metrics collection subsystem which then computes the min/max/mean.
   
   > somehow provided a plugin point where we could add something to add any of 
the parquet reader/writer stats to the thread context -trying to collect stats 
from inside wrapped-many-times-over streams and iterators is way too complex. I 
know, i have a branch of parquet where I tried that...
   
   Hmm, that will take some work. I wanted to measure streaming decompression 
time (where the `decompress` call simply returns a stream which is decompressed 
as it is read), but found it required too many breaking changes to implement. 
But a standard system like `IOStatistics` where such a stream is a 
IOStatisticsSource would be perfect. 
   




> Add metrics support for parquet file reader
> -------------------------------------------
>
>                 Key: PARQUET-2374
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2374
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.13.1
>            Reporter: Parth Chandra
>            Priority: Major
>
> ParquetFileReader is used by many engines - Hadoop, Spark among them. These 
> engines report various metrics to measure performance in different 
> environments and it is usually useful to be able to get low level metrics out 
> of the file reader and writers.
> It would be very useful to allow a simple interface to report the metrics. 
> Callers can then implement the interface to record the metrics in any 
> subsystem they choose.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to