The easiest explanation would be if some other process is continuously modifying the files. You could make a copy in a new directory and run on that to eliminate this possibility.
What do you see if you print "rd1.count()" multiple times? Have you tried the experiment on a smaller set of files? I don't know why a file would cause this problem, but maybe you can find it that way. Are you using a wildcard ("s3n://blabla/*.txt") or direct filenames? Maybe S3 forgets about the existence of some files some of the time. There could be a limit on the number of files returned to a directory query, and maybe the order is not fixed, so different files get cut off at times. (Sorry about the wild, uneducated guesses.) On Thu, Jun 19, 2014 at 5:54 PM, mrm <ma...@skimlinks.com> wrote: > Hi, > > I have had this issue for some time already, where I get different answers > when I run the same line of code twice. I have run some experiments to see > what is happening, please help me! Here is the code and the answers that I > get. I suspect I have a problem when reading large datasets from S3. > > rd1 = sc.textFile('s3n://blabla') > *rd1.persist()* > rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x)) > > Note: both filter1() and map1() are deterministic > > rd2.count() ==> 294928559 > rd2.count() ==> 294928559 > > So far so good, I get the same counts. Now when I unpersist rd1, that's > when > I start getting problems! > > *rd1.unpersist()* > rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x)) > rd2.count() ==> 294928559 > rd2.count() ==> 294509501 > rd2.count() ==> 294679795 > ... > > I would appreciate it if you could help me! > > Thanks, > Maria > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Getting-different-answers-running-same-line-of-code-tp7920.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >