vinoth created SPARK-12389:
------------------------------

             Summary: In Cluster RDD Action results are not consistent
                 Key: SPARK-12389
                 URL: https://issues.apache.org/jira/browse/SPARK-12389
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.5.2
         Environment: Centos 6.5 Machine 

One Master and 3 Worker Nodes in VM's

Master : 192.168.56.102

Worker : 192.168.56.103,192.168.56.104,192.168.56.105
            Reporter: vinoth


Just to now how the RDD recreate the lost segments without replication and test 
how the cluster wide thing work in spark.

I have the external file in linux , just to load the file and parallelize to 
cluster split the transformation and perform some action on it in local as well 
as in cluster wide.

The below are the file content 
=======================
hai hello
hai hello
vinoth test
test vinoth
test hai  
=======================
The transformation and action i tried is in the shell is:

data = sc.textFile("/tmp/test.txt")
datamap = data.flatMap(lambda x : x.split(' '))
datamap.count()

That's it i keep running the datamap.count() on every time. The result it 
produces is not consistent.

If you split the file and count it it will be 10. I just worked and the result 
is consistent  if we run the pyspark shell without master option.

If we run it on providing the master option the results or not consistent. Some 
times it produces 10 and some times it produce 9.

In between the run in shell i manually down one worker node 192.168.56.104. 
It's even surprising the result now show as "11"

I attached the result i got it from cluster wide as well as in local mode.

Please apologize me for wasting you time to read this issue , if this is the 
normal behavior in spark.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to