[ https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge Machado updated SPARK-24401: ---------------------------------- Description: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454.... as result back. As Code not 100% the same but I think there is really a bug there: {code:java} BigDecimal [] objects = new BigDecimal[]{ new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D)}; Row dataRow = new GenericRow(objects); Row dataRow2 = new GenericRow(objects); StructType structType = new StructType() .add("id1", DataTypes.createDecimalType(38,10), true) .add("id2", DataTypes.createDecimalType(38,10), true) .add("id3", DataTypes.createDecimalType(38,10), true) .add("id4", DataTypes.createDecimalType(38,10), true); final Dataset<Row> dataFrame = sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); System.out.println(dataFrame.schema()); dataFrame.show(); final Dataset<Row> df1 = dataFrame.groupBy("id1","id2") .agg( mean("id3").alias("projection_factor")); df1.show(); final Dataset<Row> df2 = df1 .groupBy("id1") .agg(max("projection_factor")); df2.show(); {code} The df2 should have: {code:java} +------------+----------------------+ | id1|max(projection_factor)| +------------+----------------------+ |3.5714285714| 3.5714285714| +------------+----------------------+ {code} instead it returns: {code:java} +------------+----------------------+ | id1|max(projection_factor)| +------------+----------------------+ |3.5714285714| 0.00035714285714| +------------+----------------------+ {code} was: Hi, I think I found a really ugly bug in spark when performing aggregations with Decimals To reproduce: {code:java} val df = spark.read.parquet("attached file") val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor")) first_agg.show val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf")) second_agg.show {code} First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454.... as result back. As Code not 100% the same but I think there is really a bug there: {code:java} BigDecimal [] objects = new BigDecimal[]{ new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D), new BigDecimal(3.5714285714D)}; Row dataRow = new GenericRow(objects); Row dataRow2 = new GenericRow(objects); StructType structType = new StructType() .add("id1", DataTypes.createDecimalType(38,10), true) .add("id2", DataTypes.createDecimalType(38,10), true) .add("id3", DataTypes.createDecimalType(38,10), true) .add("id4", DataTypes.createDecimalType(38,10), true); final Dataset<Row> dataFrame = sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); System.out.println(dataFrame.schema()); dataFrame.show(); final Dataset<Row> df1 = dataFrame.groupBy("id1","id2") .agg( mean("id3").alias("projection_factor")); df1.show(); final Dataset<Row> df2 = df1 .groupBy("id1") .agg(max("projection_factor")); df2.show(); {code} > Aggreate on Decimal Types does not work > --------------------------------------- > > Key: SPARK-24401 > URL: https://issues.apache.org/jira/browse/SPARK-24401 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0, 2.3.0 > Reporter: Jorge Machado > Priority: Major > Attachments: testDF.parquet > > > Hi, > I think I found a really ugly bug in spark when performing aggregations with > Decimals > To reproduce: > > {code:java} > val df = spark.read.parquet("attached file") > val first_agg = fact_df.groupBy("id1", "id2", > "start_date").agg(mean("projection_factor").alias("projection_factor")) > first_agg.show > val second_agg = > first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), > min("projection_factor").alias("minf")) > second_agg.show > {code} > First aggregation works fine the second aggregation seems to be summing > instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem. > The dataset as circa 800 Rows and the projection_factor has values from 0 > until 100. the result should not be bigger that 5 but with get > 265820543091454.... as result back. > > > As Code not 100% the same but I think there is really a bug there: > > {code:java} > BigDecimal [] objects = new BigDecimal[]{ > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D), > new BigDecimal(3.5714285714D)}; > Row dataRow = new GenericRow(objects); > Row dataRow2 = new GenericRow(objects); > StructType structType = new StructType() > .add("id1", DataTypes.createDecimalType(38,10), true) > .add("id2", DataTypes.createDecimalType(38,10), true) > .add("id3", DataTypes.createDecimalType(38,10), true) > .add("id4", DataTypes.createDecimalType(38,10), true); > final Dataset<Row> dataFrame = > sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType); > System.out.println(dataFrame.schema()); > dataFrame.show(); > final Dataset<Row> df1 = dataFrame.groupBy("id1","id2") > .agg( mean("id3").alias("projection_factor")); > df1.show(); > final Dataset<Row> df2 = df1 > .groupBy("id1") > .agg(max("projection_factor")); > df2.show(); > {code} > > The df2 should have: > {code:java} > +------------+----------------------+ > | id1|max(projection_factor)| > +------------+----------------------+ > |3.5714285714| 3.5714285714| > +------------+----------------------+ > {code} > instead it returns: > {code:java} > +------------+----------------------+ > | id1|max(projection_factor)| > +------------+----------------------+ > |3.5714285714| 0.00035714285714| > +------------+----------------------+ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org