Sahil Takiar created IMPALA-9819:
------------------------------------

             Summary: Separate data cache and HDFS scan node runtime profile 
metrics
                 Key: IMPALA-9819
                 URL: https://issues.apache.org/jira/browse/IMPALA-9819
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Sahil Takiar
            Assignee: Joe McDonnell


When a query reads data from both a remote storage system (e.g. S3) and the 
data cache, the HDFS_SCAN_NODE runtime profiles are hard to reason about.

For example, in the following runtime profile snippet:
{code:java}
HDFS_SCAN_NODE (id=0):(Total: 59s374ms, non-child: 0.000ns, % non-child: 0.00%)
         - AverageHdfsReadThreadConcurrency: 0.62 
         - AverageScannerThreadConcurrency: 0.91 
         - BytesRead: 587.97 MB (616533483)
         - BytesReadDataNodeCache: 0
         - BytesReadLocal: 0
         - BytesReadRemoteUnexpected: 0
         - BytesReadShortCircuit: 0
         - CachedFileHandlesHitCount: 323 (323)
         - CachedFileHandlesMissCount: 94 (94)
         - CollectionItemsRead: 0 (0)
         - DataCacheHitBytes: 212.00 MB (222294996)
         - DataCacheHitCount: 107 (107)
         - DataCacheMissBytes: 375.98 MB (394238486)
         - DataCacheMissCount: 310 (310)
         - DataCachePartialHitCount: 0 (0)
         - DecompressionTime: 2s428ms
         - MaterializeTupleTime: 19s444ms
         - MaxCompressedTextFileLength: 0
         - NumColumns: 3 (3)
         - NumDictFilteredRowGroups: 0 (0)
         - NumDisksAccessed: 1 (1)
         - NumPages: 53.30K (53300)
         - NumRowGroups: 83 (83)
         - NumRowGroupsWithPageIndex: 83 (83)
         - NumScannerThreadMemUnavailable: 0 (0)
         - NumScannerThreadReservationsDenied: 0 (0)
         - NumScannerThreadsStarted: 1 (1)
         - NumScannersWithNoReads: 0 (0)
         - NumStatsFilteredPages: 0 (0)
         - NumStatsFilteredRowGroups: 0 (0)
         - PeakMemoryUsage: 16.00 MB (16781312)
         - PeakScannerThreadConcurrency: 1 (1)
         - PerReadThreadRawHdfsThroughput: 15.11 MB/sec
         - RemoteScanRanges: 0 (0)
         - RowBatchBytesEnqueued: 670.68 MB (703260541)
         - RowBatchQueueGetWaitTime: 59s368ms
         - RowBatchQueuePeakMemoryUsage: 4.17 MB (4368285)
         - RowBatchQueuePutWaitTime: 0.000ns
         - RowBatchesEnqueued: 915 (915)
         - RowsRead: 413.47M (413466507)
         - RowsReturned: 722.27K (722275)
         - RowsReturnedRate: 12.17 K/sec
         - ScanRangesComplete: 83 (83)
         - ScannerIoWaitTime: 33s454ms
         - ScannerThreadWorklessLoops: 0 (0)
         - ScannerThreadsInvoluntaryContextSwitches: 1.94K (1940)
         - ScannerThreadsTotalWallClockTime: 1m
           - ScannerThreadsSysTime: 1s181ms
           - ScannerThreadsUserTime: 20s581ms
         - ScannerThreadsVoluntaryContextSwitches: 770 (770)
         - TotalRawHdfsOpenFileTime: 3s396ms
         - TotalRawHdfsReadTime: 38s940ms
         - TotalReadThroughput: 8.86 MB/sec {code}
The query scanned part of the data from S3 and part of the data from the data 
cache.

The confusing part is that metrics such as PerReadThreadRawHdfsThroughput are 
measured across S3 and data cache reads. So there is no straightforward way to 
determine the throughput for *just* S3 reads. Users might want this value to 
determine if S3 was particularly slow for their query.

It would be nice if the scan node metrics more clearly differentiate between 
reads from S3 vs. the data cache. The aggregate metrics (*Total* metrics) are 
still useful, but it would be useful to have fine-grained metrics that are 
specific to a data storage system (e.g. either the data cache or S3).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to