Milind V Damle created SPARK-33380:
--------------------------------------

             Summary: Incorrect output from example script pi.py
                 Key: SPARK-33380
                 URL: https://issues.apache.org/jira/browse/SPARK-33380
             Project: Spark
          Issue Type: Bug
          Components: Examples
    Affects Versions: 2.4.6
            Reporter: Milind V Damle


 
I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 
worker nodes. To test the installation, I ran the 
$SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. 
Three runs produced the following output:
 
m4-nn:~:spark-submit  --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
/usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.149880
m4-nn:~:spark-submit  --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
/usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.137760
m4-nn:~:spark-submit  --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
/usr/local/spark/examples/src/main/python/pi.py
Pi is roughly 3.155640
 
I noted that the computed value of Pi varies with each run.
Next, I ran the same script 3 more times with a higher number of partitions 
(16). The following output was noted.

m4-nn:~:spark-submit  --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
/usr/local/spark/examples/src/main/python/pi.py 16
Pi is roughly 3.141100
m4-nn:~:spark-submit  --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
/usr/local/spark/examples/src/main/python/pi.py 16
Pi is roughly 3.137720
m4-nn:~:spark-submit  --master spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
/usr/local/spark/examples/src/main/python/pi.py 16
Pi is roughly 3.145660
 
Again, I noted that the computed value of Pi varies with each run. 
 
IMO, there are 2 issues with this example script:
1. The output (value of pi) is non-deterministic because the script uses 
random.random(). 
2. Specifying the number of partitions (accepted as a command-line argument) 
has no observable positive impact on the accuracy or precision. 
 
It may be argued that the intent of these examples scripts is simply to 
demonstrate how to use Spark as well as offer a means to quickly verify an 
installation. However, we can achieve that objective without compromising on 
the accuracy or determinism of the computed value. Unless the user examines the 
script and understands that use of random.random() (to generate random points 
within the top right quadrant of the circle) as the reason behind the 
non-determinism, it seems confusing at first that the value varies per run and 
also that it is inaccurate. Someone may (incorrectly) infer that as a 
limitation of the framework!
 
To mitigate this, I wrote an alternate version to compute pi using a partial 
sum of terms from an infinite series. This script is both deterministic and can 
produce more accurate output if the user configures it to use more terms. To 
me, that behavior feels intuitive and logical. I will be happy to share it if 
it is appropriate.
 
Best regards,
Milind
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to