[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2023-12-10 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795162#comment-17795162
 ] 

Nicholas Chammas commented on SPARK-45599:
--

Per the [contributing guide|https://spark.apache.org/contributing.html], I 
suggest the {{correctness}} label instead of {{{}data-corruption{}}}.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.74

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806148#comment-17806148
 ] 

Nicholas Chammas commented on SPARK-45599:
--

I can confirm that this bug is still present on {{master}} at commit 
[a3266b411723310ec10fc1843ddababc15249db0|https://github.com/apache/spark/tree/a3266b411723310ec10fc1843ddababc15249db0].

With {{numSlices=4}} I get {{-5.924228780007003E136}} and with {{numSlices=1}} 
I get {{{}-4.739483957565084E136{}}}.

Updating the label on this issue. I will also ping some committers to bring 
this bug to their attention, as correctness bugs are taken very seriously.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: data-corruption
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> 

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-12 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806150#comment-17806150
 ] 

Nicholas Chammas commented on SPARK-45599:
--

cc [~dongjoon] - This is an old correctness bug with a concise reproduction.

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e

[jira] [Commented] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-01-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806954#comment-17806954
 ] 

Nicholas Chammas commented on SPARK-45599:
--

Using [Hypothesis|https://github.com/HypothesisWorks/hypothesis], I've managed 
to shrink the provided test case from 373 elements down to 14:

{code:python}
from math import nan
from pyspark.sql import SparkSession

HYPOTHESIS_EXAMPLE = [
(0.0,),
(2.0,),
(153.0,),
(168.0,),
(3252411229536261.0,),
(7.205759403792794e+16,),
(1.7976931348623157e+308,),
(0.25,),
(nan,),
(nan,),
(-0.0,),
(-128.0,),
(nan,),
(nan,),
]

spark = (
SparkSession.builder
.config("spark.log.level", "ERROR")
.getOrCreate()
)


def compare_percentiles(data, slices):
rdd = spark.sparkContext.parallelize(data, numSlices=1)
df = spark.createDataFrame(rdd, "val double")
result1 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

rdd = spark.sparkContext.parallelize(data, numSlices=slices)
df = spark.createDataFrame(rdd, "val double")
result2 = df.selectExpr('percentile(val, 0.1)').collect()[0][0]

assert result1 == result2, f"{result1}, {result2}"


if __name__ == "__main__":
compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
{code}

Running this test fails as follows:

{code:python}
Traceback (most recent call last):  
  File ".../SPARK-45599.py", line 41, in 
compare_percentiles(HYPOTHESIS_EXAMPLE, 2)
  File ".../SPARK-45599.py", line 37, in compare_percentiles
assert result1 == result2, f"{result1}, {result2}"
   ^^
AssertionError: 0.050044, -0.0
{code}

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.43097