Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22752#discussion_r226243409
  
    --- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
                   listing.write(info.copy(lastProcessed = newLastScanTime, 
fileSize = entry.getLen()))
                 }
     
    -            if (info.fileSize < entry.getLen()) {
    +            if (info.fileSize < entry.getLen() || 
checkAbsoluteLength(info, entry)) {
    --- End diff --
    
    ...there's no timetable for that getLength thing, but if HDFS already 
supports the API, I'm more motivated to implement it. It has benefits in cloud 
stores in general
    1. saves apps going an up front HEAD/getFileStatus() to know how long their 
data is; the GET should return it.
    2. for S3 Select, you get back the filtered data so don't know how much you 
will see until the GET is issued


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to