[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976317#comment-14976317 ]
Yuval Tanny commented on SPARK-11303: ------------------------------------- Is the fix is going to be merged to 1.5 (and 1.5.2)? Thanks > sample (without replacement) + filter returns wrong results in DataFrame > ------------------------------------------------------------------------ > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. > Reporter: Yuval Tanny > Fix For: 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org