[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973993#comment-14973993 ]
Yanbo Liang edited comment on SPARK-11303 at 10/26/15 10:29 AM: ---------------------------------------------------------------- It looks like this bug caused by mutable row copy related problem similar with SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this issue. I found *map(_copy())* was removed by https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the motivation of removing *map(_copy())* for withReplacement = false in that PR? was (Author: yanboliang): It looks like this bug caused by mutable row copy related problem similar with SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this issue. I found *map(_copy())* was removed by https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the motivation of removing *map(_copy())* in that PR? > sample (without replacement) + filter returns wrong results in DataFrame > ------------------------------------------------------------------------ > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. > Reporter: Yuval Tanny > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > output: > 14 > 7 > 8 > 14 > 7 > 7 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org