Milind V Damle created SPARK-33380: -------------------------------------- Summary: Incorrect output from example script pi.py Key: SPARK-33380 URL: https://issues.apache.org/jira/browse/SPARK-33380 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 2.4.6 Reporter: Milind V Damle
I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 worker nodes. To test the installation, I ran the $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. Three runs produced the following output: m4-nn:~:spark-submit --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/] /usr/local/spark/examples/src/main/python/pi.py Pi is roughly 3.149880 m4-nn:~:spark-submit --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/] /usr/local/spark/examples/src/main/python/pi.py Pi is roughly 3.137760 m4-nn:~:spark-submit --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/] /usr/local/spark/examples/src/main/python/pi.py Pi is roughly 3.155640 I noted that the computed value of Pi varies with each run. Next, I ran the same script 3 more times with a higher number of partitions (16). The following output was noted. m4-nn:~:spark-submit --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/] /usr/local/spark/examples/src/main/python/pi.py 16 Pi is roughly 3.141100 m4-nn:~:spark-submit --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/] /usr/local/spark/examples/src/main/python/pi.py 16 Pi is roughly 3.137720 m4-nn:~:spark-submit --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/] /usr/local/spark/examples/src/main/python/pi.py 16 Pi is roughly 3.145660 Again, I noted that the computed value of Pi varies with each run. IMO, there are 2 issues with this example script: 1. The output (value of pi) is non-deterministic because the script uses random.random(). 2. Specifying the number of partitions (accepted as a command-line argument) has no observable positive impact on the accuracy or precision. It may be argued that the intent of these examples scripts is simply to demonstrate how to use Spark as well as offer a means to quickly verify an installation. However, we can achieve that objective without compromising on the accuracy or determinism of the computed value. Unless the user examines the script and understands that use of random.random() (to generate random points within the top right quadrant of the circle) as the reason behind the non-determinism, it seems confusing at first that the value varies per run and also that it is inaccurate. Someone may (incorrectly) infer that as a limitation of the framework! To mitigate this, I wrote an alternate version to compute pi using a partial sum of terms from an infinite series. This script is both deterministic and can produce more accurate output if the user configures it to use more terms. To me, that behavior feels intuitive and logical. I will be happy to share it if it is appropriate. Best regards, Milind -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org