[ 
https://issues.apache.org/jira/browse/SPARK-18505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18505.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

> Simplify AnalyzeColumnCommand
> -----------------------------
>
>                 Key: SPARK-18505
>                 URL: https://issues.apache.org/jira/browse/SPARK-18505
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>             Fix For: 2.1.0
>
>
> I'm spending more time at the design & code level for cost-based optimizer 
> now, and have found a number of issues related to maintainability and 
> compatibility that I will like to address.
> This is a small pull request to clean up AnalyzeColumnCommand:
> 1. Removed warning on duplicated columns. Warnings in log messages are 
> useless since most users that run SQL don't see them.
> 2. Removed the nested updateStats function, by just inlining the function.
> 3. Renamed a few functions to better reflect what they do.
> 4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern 
> to use a apply method that returns an instantiation of a class that is not of 
> the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
> 5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
> 6. Added more documentation explaining some of the non-obvious return types 
> and code blocks.
> In follow-up pull requests, I'd like to address the following:
> 1. Get rid of the Map[String, ColumnStat] map, since internally we should be 
> using Attribute to reference columns, rather than strings.
> 2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's 
> execution path. Currently the two are coupled because ColumnStat takes in an 
> InternalRow.
> 3. Correctness: Remove code path that stores statistics in the catalog using 
> the base64 encoding of the UnsafeRow format, which is not stable across Spark 
> versions.
> 4. Clearly document the data representation stored in the catalog for 
> statistics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to