Dan Brown created SPARK-10939:
---------------------------------

             Summary: Misaligned data with RDD.zip after repartition
                 Key: SPARK-10939
                 URL: https://issues.apache.org/jira/browse/SPARK-10939
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.5.0, 1.4.1, 1.3.0
         Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
- Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
            Reporter: Dan Brown


Split out from https://issues.apache.org/jira/browse/SPARK-10685:

Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces 
"misaligned" data, meaning different column values in the same row aren't 
matched, as if a zip shuffled the collections before zipping them. It's 
difficult to reproduce because it's nondeterministic, doesn't occur in local 
mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using 
pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 
(bin-without-hadoop).

Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
to build it ourselves when we ran into this problem. Let me put in my vote for 
reopening the issue and supporting {{DataFrame.zip}} in the standard lib.

- https://issues.apache.org/jira/browse/SPARK-7460

h3. Repro

Fail: RDD.zip after repartition
{code}
df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(10000))
df  = df.repartition(100)
rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
b=y.b))
[r for r in rdd.collect() if r.a != r.b][:3] # Should be []
{code}

Sample outputs (nondeterministic):
{code}
[]
[Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
[]
[]
[Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
[]
{code}

Test setup:

- local\[8]: {{MASTER=local\[8]}}
- dist\[N]: 1 driver + 1 master + N workers

{code}
"Fail" tests pass?  cluster mode  spark version
----------------------------------------------------
yes                 local[8]      1.3.0-cdh5.4.5
no                  dist[4]       1.3.0-cdh5.4.5
yes                 local[8]      1.4.1
yes                 dist[1]       1.4.1
no                  dist[2]       1.4.1
no                  dist[4]       1.4.1
yes                 local[8]      1.5.0
yes                 dist[1]       1.5.0
no                  dist[2]       1.5.0
no                  dist[4]       1.5.0
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to