[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-45599. --------------------------------- Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 45036 [https://github.com/apache/spark/pull/45036] > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > ------------------------------------------------------------------------------ > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 > Reporter: Robert Joseph Evans > Assignee: Nicholas Chammas > Priority: Critical > Labels: correctness, pull-request-available > Fix For: 3.5.1, 4.0.0 > > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,)] > df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, > numSlices=4), StructType([StructField('val',DoubleType(),True)])) > df.selectExpr('percentile(val, 0.1)').show(truncate=False){code} > You will get back a result of {{-5.924228780007003E136}} but the correct > answer is {{-4.739483957565084E136}} which can be verified by changing the > number of slices used to import the data into spark. > > What is happening is that we are getting super unlucky. In the 4th input > partition the data is read in from 7.849390806334983E226 to the end. This > works fine and we get an OpenHashMap with an entry for both 0.0 and -0.0 > > {code:java} > OpenHashMap((1.0075153091760986E-236,1), (0.0,1), > (-2.646144697462316E-35,1), (-7.951172614280046E-74,1), > (-3.468683249247593E-196,1), (5.741383583382542E-14,1), > (-2.664533698035492E203,1), (7.227411393136537E206,1), > (-3.588518231990927E12,1), (1.9042708096454302E195,1), > (-1.9569489404314425E128,1), (7.849390806334983E226,1), > (2.187766760184779E306,1), (2.4925669515657655E165,1), > (-5.620339412794757E-251,1), (-0.0,1), (-1.1041009815645263E203,1), > (-2.3016388448634844E-155,1), (2.1973064836362255E-159,1), > (-1.831402251805194E65,1), (1.4403274475565667E41,1), > (-3.085825028509117E74,1), (-6.542279108216022E206,1), > (-9.871721037428167E119,1), (8.475809563703283E-64,1), > (-5.673977270111637E-84,1), (5.120795466142678E-215,1), > (-5.046677974902737E132,1), (-4.5154178008513483E-122,1), > (1.9846569552093057E-137,1), (-3.3885098786542755E-128,1), > (7.679268835670585E223,1), (4.920675036053339E9,1), (-1.0,1), > (-4.585039307326895E166,1), (-9.01991342808203E282,1), > (5.4470705929955455E-86,1), (9.247723870123388E-295,1), > (-1.8891559842111865E63,1), (-4.696878567317712E-162,1), > (-1.4882040107841963E286,1), (-5.936844510098297E-82,1), > (6.226527569272003E152,1), (-1.1961155424160076E102,1), > (-1.6663254121185628E-256,1), (4.4501477170144023E-308,1), > (-9.607772864590422E217,1), (-3.010452936419635E-233,1), > (4.051866849943636E-254,1), (1.4309793775440402E-87,1), > (2.5212410617263588E-282,1), (3.4543959813437507E-304,1), > (0.028096279323357867,1), (-7.590734560275502E-63,1), > (5.211702553315461E-259,1), (-1.0186016078084965E-281,1), > (3.437191836077251E209,1), (NaN,1), (NaN,1), (NaN,1), > (8.391630779050713E-135,1), (-5.490780063080251E-9,1), > (-2.9383643865423363E-103,1), (2.0738138203216883E201,1), > (1.8461539468514548E-225,1), (1.822129180806602E-245,1), (NaN,1), > (4.205809391029644E137,1), (2.037360925124577E292,1), > (-2.428999624265911E-293,1), (3.002803065141241E-139,1), > (6.3131466321042515E153,1), (3.5103766991437114E-60,1), > (3.217759099462207E108,1), (-8.796717685143486E203,1), > (-2.2385155698231885E285,1), (2.176024662699802E-210,1), > (1.703824427218836E-55,1), (9.376528689861087E117,1), > (1.7976931348623157E308,2), (NaN,1), (5.891823952773268E98,1), > (-2.1696969883753554E-292,1), (4.3214483342777574E-117,1), (Infinity,2), > (1.779652973678931E173,1), (-5.234708055733116E12,1), > (-5.682293414619055E46,1)) > {code} > But when we go to deserialize the map after the shuffle we get a different > result out with the entry for -0.0 gone. That is because the deserialize code > uses update that will overwrite the count if the keys are the same. But -0.0 > and 0.0 are not the same. Unless they happen to hash to the same position > when they are being added in. In that case they end up being equal to each > other and the count for 0.0 is replaced with the count for -0.0 and we lose > one row in the data. > > Because the keys are stored in OpenHashSet that is where the bug actually is. > [https://github.com/apache/spark/blob/7e82e1bc43e0297c3036d802b3a151d2b93db2f6/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L139-L159] > > I see a few ways to fix this. > # Update OpenHashSet/OpenHashMap to do the right thing for floats and > doubles around -0.0 and 0.0 > # normalize nans and zeros before doing percentiles > # reinterpret the bits for float/double as an int/long before putting them > into the map and do the reverse when we pull them out. That would also have > the advantage of making NaN == NaN which would reduce the size of the map in > those cases for percentile. > # Update the deserialize code for percentile to not call update, but instead > to call {{counts.changeValue(key, count, _ + count)}} > I am not sure if something similar can happen in other places, but I know for > hash aggregate/etc we normalize the floating point values because of things > like this. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org