[jira] [Comment Edited] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

Huaxin Gao (JIRA) Fri, 13 Oct 2017 15:35:53 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204281#comment-16204281
 ]


Huaxin Gao edited comment on SPARK-22271 at 10/13/17 10:34 PM:
---------------------------------------------------------------

I looked the code, in Average.scala, it has
{code}
  override lazy val evaluateExpression = child.dataType match {
    case DecimalType.Fixed(p, s) =>
      // increase the precision and scale to prevent precision loss
      val dt = DecimalType.bounded(p + 14, s + 4)
      Cast(Cast(sum, dt) / Cast(count, dt), resultType)
    ......
  }
{code}
When using Shafique's test data, dt has precision 38 and scale 36. count is 
299. Cast(count, dt) will set the scale to 36 and precision to 39, this will 
cause overflow. 
I have a fix and will submit a PR soon. 


was (Author: huaxingao):
I looked the code, in Average.scala, it has
```
  override lazy val evaluateExpression = child.dataType match {
    case DecimalType.Fixed(p, s) =>
      // increase the precision and scale to prevent precision loss
      val dt = DecimalType.bounded(p + 14, s + 4)
      Cast(Cast(sum, dt) / Cast(count, dt), resultType)
    ......
  }
```
When using Shafique's test data, dt has precision 38 and scale 36. count is 
299. Cast(count, dt) will set the scale to 36 and precision to 39, this will 
cause overflow. 
I have a fix and will submit a PR soon. 

> Describe results in "null" for the value of "mean" of a numeric variable
> ------------------------------------------------------------------------
>
>                 Key: SPARK-22271
>                 URL: https://issues.apache.org/jira/browse/SPARK-22271
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>         Environment: 
>            Reporter: Shafique Jamal
>            Priority: Minor
>         Attachments: decimalNumbers.zip
>
>
> Please excuse me if this issue was addressed already - I was unable to find 
> it.
> Calling .describe().show() on my dataframe results in a value of null for the 
> row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")        
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +-------+--------------------+
> |summary|     numericvariable|
> +-------+--------------------+
> |  count|                 299|
> |   mean|                null|
> | stddev|  0.2376438793946738|
> |    min|0.037815489727642...|
> |    max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I 
> round the column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +-------+---------------------------+
> |summary|bround(numericvariable, 31)|
> +-------+---------------------------+
> |  count|                        299|
> |   mean|       0.139522503183236...|
> | stddev|         0.2376438793946738|
> |    min|       0.037815489727642...|
> |    max|       2.138189366554511...|
> +-------+---------------------------+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable

Reply via email to