[ https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955139#comment-14955139 ]
Prasad Chalasani commented on SPARK-11008: ------------------------------------------ yes that's exactly what I was seeing. You're seeing it with HDFS too, so I don't think the S3 eventual consistency has anything to do with it. I'm starting to wonder if this could be a spark issue after all. > Spark window function returns inconsistent/wrong results > -------------------------------------------------------- > > Key: SPARK-11008 > URL: https://issues.apache.org/jira/browse/SPARK-11008 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 1.4.0, 1.5.0 > Environment: Amazon Linux AMI (Amazon Linux version 2015.09) > Reporter: Prasad Chalasani > Priority: Minor > > Summary: applying a windowing function on a data-frame, followed by count() > gives widely varying results in repeated runs: none exceed the correct value, > but of course all but one are wrong. On large data-sets I sometimes get as > small as HALF of the correct value. > A minimal reproducible example is here: > (1) start spark-shell > (2) run these: > val data = 1.to(100).map(x => (x,1)) > import sqlContext.implicits._ > val tbl = sc.parallelize(data).toDF("id", "time") > tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") > (3) exit the shell (this is important) > (4) start spark-shell again > (5) run these: > import org.apache.spark.sql.expressions.Window > val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") > val win = Window.partitionBy("id").orderBy("time") > df.select($"id", > (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count() > I get 98, but the correct result is 100. > If I re-run the code in step 5 in the same shell, then the result gets > "fixed" and I always get 100. > Note this is only a minimal reproducible example to reproduce the error. In > my real application the size of the data is much larger and the window > function is not trivial as above (i.e. there are multiple "time" values per > "id", etc), and I see results sometimes as small as HALF of the correct value > (e.g. 120,000 while the correct value is 250,000). So this is a serious > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org