Hi,
I was wondering what's wrong with FileSystem.getContentSummary
in CommandUtils.calculateLocationSize as "expressed" in the comment [1]:
// This method is mainly based on
//
org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf,
Table)
// in Hive 0.13 (except that we do not use fs.getContentSummary).
// TODO: Generalize statistics collection.
// TODO: Why fs.getContentSummary returns wrong size on Jenkins?
// Can we use fs.getContentSummary in future?
// Seems fs.getContentSummary returns wrong table size on Jenkins. So
we use
// countFileSize to count the table size.
until I found out that there seems to be no issue whatsoever
since DetermineTableStats uses it just fine [2].
Why does CommandUtils.calculateLocationSize *not* use what
DetermineTableStats does successfully?
[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala#L66-L73
[2]
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala?utf8=%E2%9C%93#L126
Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski