Siddartha Naidu created SPARK-31430:
---------------------------------------

             Summary: Bug in the approximate quantile computation.
                 Key: SPARK-31430
                 URL: https://issues.apache.org/jira/browse/SPARK-31430
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Siddartha Naidu


I am seeing a bug where passing lower relative error to the {{approxQuantile}} 
function is leading to incorrect result in the presence of partitions. Setting 
a relative error 1e-6 causes it to compute equal values for 0.9 and 1.0 
quantiles. Coalescing it back to 1 partition gives correct results. This issue 
was not present in spark version 2.4.5, we noticed it when testing 
3.0.0-preview.

{{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', header=True, 
schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
{{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
{{[1422576000.0, 1430352000.0, 1438300800.0]}}
{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.00001)}}
{{[1422576000.0, 1430524800.0, 1438300800.0]}}
{color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
0.000001)}}{color}
{color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
{{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.000001)}}
{{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to