Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/22752#discussion_r226243409 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen())) } - if (info.fileSize < entry.getLen()) { + if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) { --- End diff -- ...there's no timetable for that getLength thing, but if HDFS already supports the API, I'm more motivated to implement it. It has benefits in cloud stores in general 1. saves apps going an up front HEAD/getFileStatus() to know how long their data is; the GET should return it. 2. for S3 Select, you get back the filtered data so don't know how much you will see until the GET is issued
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org