Hi, I have had this issue for some time already, where I get different answers when I run the same line of code twice. I have run some experiments to see what is happening, please help me! Here is the code and the answers that I get. I suspect I have a problem when reading large datasets from S3.
rd1 = sc.textFile('s3n://blabla') *rd1.persist()* rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x)) Note: both filter1() and map1() are deterministic rd2.count() ==> 294928559 rd2.count() ==> 294928559 So far so good, I get the same counts. Now when I unpersist rd1, that's when I start getting problems! *rd1.unpersist()* rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x)) rd2.count() ==> 294928559 rd2.count() ==> 294509501 rd2.count() ==> 294679795 ... I would appreciate it if you could help me! Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-different-answers-running-same-line-of-code-tp7920.html Sent from the Apache Spark User List mailing list archive at Nabble.com.