Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21950#discussion_r218608537
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
    @@ -1051,11 +1052,27 @@ private[hive] object HiveClientImpl {
         // When table is external, `totalSize` is always zero, which will 
influence join strategy.
         // So when `totalSize` is zero, use `rawDataSize` instead. When 
`rawDataSize` is also zero,
         // return None.
    +    // If a table has a deserialization factor, the table owner expects 
the in-memory
    +    // representation of the table to be larger than the table's totalSize 
value. In that case,
    +    // multiply totalSize by the deserialization factor and use that 
number instead.
    +    // If the user has set spark.sql.statistics.ignoreRawDataSize to true 
(because of HIVE-20079,
    +    // for example), don't use rawDataSize.
         // In Hive, when statistics gathering is disabled, `rawDataSize` and 
`numRows` is always
         // zero after INSERT command. So they are used here only if they are 
larger than zero.
    -    if (totalSize.isDefined && totalSize.get > 0L) {
    -      Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = 
rowCount.filter(_ > 0)))
    -    } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
    +    val factor = try {
    +        properties.get("deserFactor").getOrElse("1.0").toDouble
    --- End diff --
    
    I need to eliminate this duplication: There's a similar lookup and 
calculation done in PruneFileSourcePartitionsSuite. Also, I should check if a 
Long value, used as an intermediate value, is acceptable to hold file sizes 
(possibly, since a Long can represent 8 exabytes)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to