[ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556818#comment-16556818
 ] 

yucai commented on SPARK-24925:
-------------------------------

I think there could be two issues.

In FileScanRDD
1. ColumnarBatch's bytesRead need to be updated for every 4096 * 1000 rows, 
which makes the metrics out of date.
2. When advancing to the next file, FileScanRDD always adds the whole file 
length into bytesRead, which is inaccurate (pushdown reads much less data).

For problem 1, in https://github.com/apache/spark/pull/21791, I tried to update 
the ColumnarBatch's bytesRead for each batch.
For problem 2, updateBytesReadWithFileSize says, "If we can't get the bytes 
read from the FS stats, fall back to the file size", can we update only when 
this situation happens?

> input bytesRead metrics fluctuate from time to time
> ---------------------------------------------------
>
>                 Key: SPARK-24925
>                 URL: https://issues.apache.org/jira/browse/SPARK-24925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: yucai
>            Priority: Major
>         Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to