See https://issues.apache.org/jira/browse/SPARK-2579
It also was mentioned on the mailing list a while ago, and have heard tell of this from customers. I am trying to get to the bottom of it too. What version are you using, to start? I am wondering if it was fixed in 1.0.x since I was not able to reproduce it in my example. On Fri, Aug 1, 2014 at 12:37 AM, nit <nitinp...@gmail.com> wrote: > *First Question:* > > On Amazon S3 I have a directory with 1024 files, where each file size is > ~9Mb; and each line in a file has two entries separated by '\t'. > > Here is my program, which is calculating total number of entries in the > dataset > > -- > val inputId = sc.textFile(inputhPath, noParts).flatMap {line=> > val lineArray = line.split("\\t") > Iterator(lineArray(0).toLong, lineArray(1).toLong) > }.distinct(noParts) > println("######input-cnt = %s; ". > format(inputId.count)) > -- > Where inputpath = > "s3n://my-AWS_ACCESS_KEY_ID:myAWS_ACCESS_KEY_SECRET@bucket-id/directory" > > When I run this program multiple times on EC2, "input-cnt" across > iterations is not consistent. FYI, I uploaded the data to S3 two days back; > so I assume by now data is properly replicated/(eventually-concistency). > * > Is this a known issue with S3? What it the solution? > * > Note: When I ran same experiment on my yarn cluster; where inputhPath is > hdfs-path, I got the results as expected.