[ https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-33380: ------------------------------------ Assignee: (was: Apache Spark) > Incorrect output from example script pi.py > ------------------------------------------ > > Key: SPARK-33380 > URL: https://issues.apache.org/jira/browse/SPARK-33380 > Project: Spark > Issue Type: Bug > Components: Examples > Affects Versions: 2.4.6 > Reporter: Milind V Damle > Priority: Minor > > > I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 > worker nodes. To test the installation, I ran the > $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. > Three runs produced the following output: > > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.149880 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.137760 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py > Pi is roughly 3.155640 > > I noted that the computed value of Pi varies with each run. > Next, I ran the same script 3 more times with a higher number of partitions > (16). The following output was noted. > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.141100 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.137720 > m4-nn:~:spark-submit --master > spark://[10.0.0.20:7077|http://10.0.0.20:7077/] > /usr/local/spark/examples/src/main/python/pi.py 16 > Pi is roughly 3.145660 > > Again, I noted that the computed value of Pi varies with each run. > > IMO, there are 2 issues with this example script: > 1. The output (value of pi) is non-deterministic because the script uses > random.random(). > 2. Specifying the number of partitions (accepted as a command-line argument) > has no observable positive impact on the accuracy or precision. > > It may be argued that the intent of these examples scripts is simply to > demonstrate how to use Spark as well as offer a means to quickly verify an > installation. However, we can achieve that objective without compromising on > the accuracy or determinism of the computed value. Unless the user examines > the script and understands that use of random.random() (to generate random > points within the top right quadrant of the circle) as the reason behind the > non-determinism, it seems confusing at first that the value varies per run > and also that it is inaccurate. Someone may (incorrectly) infer that as a > limitation of the framework! > > To mitigate this, I wrote an alternate version to compute pi using a partial > sum of terms from an infinite series. This script is both deterministic and > can produce more accurate output if the user configures it to use more terms. > To me, that behavior feels intuitive and logical. I will be happy to share it > if it is appropriate. > > Best regards, > Milind > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org