[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.
[ https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211286#comment-17211286 ] Sean R. Owen commented on SPARK-31430: -- Sounds good, I usually mark as a Duplicate. > Bug in the approximate quantile computation. > > > Key: SPARK-31430 > URL: https://issues.apache.org/jira/browse/SPARK-31430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Siddartha Naidu >Priority: Major > Attachments: approx_quantile_data.csv > > > I am seeing a bug where passing lower relative error to the > {{approxQuantile}} function is leading to incorrect result in the presence of > partitions. Setting a relative error 1e-6 causes it to compute equal values > for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct > results. This issue was not present in spark version 2.4.5, we noticed it > when testing 3.0.0-preview. > {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', > header=True, > schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}} > {{>>> df = df.repartition(200, 'Store').localCheckpoint()}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}} > {{[1422576000.0, 1430352000.0, 1438300800.0]}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} > {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], > 0.01)}}{color} > {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color} > {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.
[ https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211282#comment-17211282 ] Aoyuan Liao commented on SPARK-31430: - [~srowen] This is already fixed. > Bug in the approximate quantile computation. > > > Key: SPARK-31430 > URL: https://issues.apache.org/jira/browse/SPARK-31430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Siddartha Naidu >Priority: Major > Attachments: approx_quantile_data.csv > > > I am seeing a bug where passing lower relative error to the > {{approxQuantile}} function is leading to incorrect result in the presence of > partitions. Setting a relative error 1e-6 causes it to compute equal values > for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct > results. This issue was not present in spark version 2.4.5, we noticed it > when testing 3.0.0-preview. > {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', > header=True, > schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}} > {{>>> df = df.repartition(200, 'Store').localCheckpoint()}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}} > {{[1422576000.0, 1430352000.0, 1438300800.0]}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} > {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], > 0.01)}}{color} > {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color} > {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.
[ https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207920#comment-17207920 ] Vladimir commented on SPARK-31430: -- Bug fixed in https://issues.apache.org/jira/browse/SPARK-32908 > Bug in the approximate quantile computation. > > > Key: SPARK-31430 > URL: https://issues.apache.org/jira/browse/SPARK-31430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Siddartha Naidu >Priority: Major > Attachments: approx_quantile_data.csv > > > I am seeing a bug where passing lower relative error to the > {{approxQuantile}} function is leading to incorrect result in the presence of > partitions. Setting a relative error 1e-6 causes it to compute equal values > for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct > results. This issue was not present in spark version 2.4.5, we noticed it > when testing 3.0.0-preview. > {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', > header=True, > schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}} > {{>>> df = df.repartition(200, 'Store').localCheckpoint()}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}} > {{[1422576000.0, 1430352000.0, 1438300800.0]}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} > {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], > 0.01)}}{color} > {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color} > {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.
[ https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105363#comment-17105363 ] Karim Magomedov commented on SPARK-31430: - I'd like to work on this issue > Bug in the approximate quantile computation. > > > Key: SPARK-31430 > URL: https://issues.apache.org/jira/browse/SPARK-31430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Siddartha Naidu >Priority: Major > Attachments: approx_quantile_data.csv > > > I am seeing a bug where passing lower relative error to the > {{approxQuantile}} function is leading to incorrect result in the presence of > partitions. Setting a relative error 1e-6 causes it to compute equal values > for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct > results. This issue was not present in spark version 2.4.5, we noticed it > when testing 3.0.0-preview. > {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', > header=True, > schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}} > {{>>> df = df.repartition(200, 'Store').localCheckpoint()}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}} > {{[1422576000.0, 1430352000.0, 1438300800.0]}} > {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} > {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], > 0.01)}}{color} > {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color} > {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}} > {{[1422576000.0, 1430524800.0, 1438300800.0]}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org