[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

juliuszsompolski Thu, 22 Feb 2018 12:42:47 -0800

Github user juliuszsompolski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20624#discussion_r170086812
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
    @@ -387,6 +390,101 @@ case class CatalogStatistics(
       }
     }
     
    +/**
    + * This class of statistics for a column is used in [[CatalogTable]] to 
interact with metastore.
    + */
    +case class CatalogColumnStat(
    +  distinctCount: Option[BigInt] = None,
    +  min: Option[String] = None,
    +  max: Option[String] = None,
    +  nullCount: Option[BigInt] = None,
    +  avgLen: Option[Long] = None,
    +  maxLen: Option[Long] = None,
    +  histogram: Option[Histogram] = None) {
    +
    +  /**
    +   * Returns a map from string to string that can be used to serialize the 
column stats.
    +   * The key is the name of the column and name of the field (e.g. 
"colName.distinctCount"),
    +   * and the value is the string representation for the value.
    +   * min/max values are stored as Strings. They can be deserialized using
    +   * [[ColumnStat.fromExternalString]].
    +   *
    +   * As part of the protocol, the returned map always contains a key 
called "version".
    +   * In the case min/max values are null (None), they won't appear in the 
map.
    +   */
    +  def toMap(colName: String): Map[String, String] = {
    +    val map = new scala.collection.mutable.HashMap[String, String]
    +    map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1")
    +    distinctCount.foreach { v =>
    +      map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", 
v.toString)
    +    }
    +    nullCount.foreach { v =>
    +      map.put(s"${colName}.${CatalogColumnStat.KEY_NULL_COUNT}", 
v.toString)
    +    }
    +    avgLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_AVG_LEN}", v.toString) }
    +    maxLen.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_LEN}", v.toString) }
    +    min.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MIN_VALUE}", v) }
    +    max.foreach { v => 
map.put(s"${colName}.${CatalogColumnStat.KEY_MAX_VALUE}", v) }
    +    histogram.foreach { h =>
    +      map.put(s"${colName}.${CatalogColumnStat.KEY_HISTOGRAM}", 
HistogramSerializer.serialize(h))
    +    }
    +    map.toMap
    +  }
    +
    +  /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */
    +  def toPlanStat(
    +      colName: String,
    +      dataType: DataType): ColumnStat =
    +    ColumnStat(
    +      distinctCount = distinctCount,
    +      min = min.map(ColumnStat.fromExternalString(_, colName, dataType)),
    +      max = max.map(ColumnStat.fromExternalString(_, colName, dataType)),
    +      nullCount = nullCount,
    +      avgLen = avgLen,
    +      maxLen = maxLen,
    +      histogram = histogram)
    +}
    +
    +object CatalogColumnStat extends Logging {
    +
    +  // List of string keys used to serialize CatalogColumnStat
    +  val KEY_VERSION = "version"
    +  private val KEY_DISTINCT_COUNT = "distinctCount"
    +  private val KEY_MIN_VALUE = "min"
    +  private val KEY_MAX_VALUE = "max"
    +  private val KEY_NULL_COUNT = "nullCount"
    +  private val KEY_AVG_LEN = "avgLen"
    +  private val KEY_MAX_LEN = "maxLen"
    +  private val KEY_HISTOGRAM = "histogram"
    +
    +  /**
    +   * Creates a [[CatalogColumnStat]] object from the given map.
    +   * This is used to deserialize column stats from some external storage.
    +   * The serialization side is defined in [[CatalogColumnStat.toMap]].
    +   */
    +  def fromMap(
    +    table: String,
    +    colName: String,
    +    map: Map[String, String]): Option[CatalogColumnStat] = {
    +
    +    try {
    +      Some(CatalogColumnStat(
    +        distinctCount = map.get(s"${colName}.${KEY_DISTINCT_COUNT}").map(v 
=> BigInt(v.toLong)),
    --- End diff --
    
    The keys or format of stats in the metastore didn't change. After this 
patch it remains backwards compatible with stats created before.
    
    What changed here is that the `map` passed here used to contain stats for 
just one column, stripped of the columnName prefix, and now I'm passing a `map` 
that has all statistics for all columns, with keys prefixed by columnName.
    
    It reduces complexity in `statsFromProperties`, see 
https://github.com/apache/spark/pull/20624/files#diff-159191585e10542f013cb3a714f26075R1057
    It used to create a filtered map for every column, stripping the prefix 
together with column name.
    Now it just passes the map of all column's stat properties, and an 
individual column picks up what it needs.
    
    I'll add a bit of doc / comments about that.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring

Reply via email to