Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, 

I am using spark 1.0.0. The bug is fixed by 1.0.1.

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list.  I 
believe many people don't subscribe the incubator mail list now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote:

 Hi, 
 
 I am using spark 1.0.0. The bug is fixed by 1.0.1.
 
 Hao
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Ah, thank you. I did not notice that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
|  Do the two mailing lists share messages ?
I don't think so.  I didn't receive this message from the user list. I am not 
in databricks, so I can't answer your other questions. Maybe Davies Liu 
dav...@databricks.com can answer you?

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:

 Hi, Xianjin
 
 I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my 
 post there:
 http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
 
 I am using nabble to send this mail, which indicates that the mail will be
 sent from my email address to the u...@spark.incubator.apache.org 
 (mailto:u...@spark.incubator.apache.org) mailing
 list.
 
 Do the two mailing lists share messages ?
 
 Do we have a nabble interface for user@spark.apache.org 
 (mailto:user@spark.apache.org) mail list ?
 
 Thank you.
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: groupBy gives non deterministic results

2014-09-10 Thread Davies Liu
I think the mails to spark.incubator.apache.org will be forwarded to
spark.apache.org.

Here is the header of the first mail:

from: redocpot julien19890...@gmail.com
to: u...@spark.incubator.apache.org
date: Mon, Sep 8, 2014 at 7:29 AM
subject: groupBy gives non deterministic results
mailing list: user.spark.apache.org Filter messages from this mailing list
mailed-by: spark.apache.org

I only subscribe spark.apache.org, and I do see all the mails from he.

On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com wrote:
 |  Do the two mailing lists share messages ?
 I don't think so.  I didn't receive this message from the user list. I am
 not in databricks, so I can't answer your other questions. Maybe Davies Liu
 dav...@databricks.com can answer you?

 --
 Ye Xianjin
 Sent with Sparrow

 On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:

 Hi, Xianjin

 I checked user@spark.apache.org, and found my post there:
 http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser

 I am using nabble to send this mail, which indicates that the mail will be
 sent from my email address to the u...@spark.incubator.apache.org mailing
 list.

 Do the two mailing lists share messages ?

 Do we have a nabble interface for user@spark.apache.org mail list ?

 Thank you.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user 
list. Maybe because I also subscribe the incubator mail list? I do see mails 
sending to incubator mail list and no one replies. I thought it was because 
people don't subscribe the incubator now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, September 11, 2014 at 12:12 AM, Davies Liu wrote:

 I think the mails to spark.incubator.apache.org 
 (http://spark.incubator.apache.org) will be forwarded to
 spark.apache.org (http://spark.apache.org).
 
 Here is the header of the first mail:
 
 from: redocpot julien19890...@gmail.com (mailto:julien19890...@gmail.com)
 to: u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org)
 date: Mon, Sep 8, 2014 at 7:29 AM
 subject: groupBy gives non deterministic results
 mailing list: user.spark.apache.org (http://user.spark.apache.org) Filter 
 messages from this mailing list
 mailed-by: spark.apache.org (http://spark.apache.org)
 
 I only subscribe spark.apache.org (http://spark.apache.org), and I do see all 
 the mails from he.
 
 On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  | Do the two mailing lists share messages ?
  I don't think so. I didn't receive this message from the user list. I am
  not in databricks, so I can't answer your other questions. Maybe Davies Liu
  dav...@databricks.com (mailto:dav...@databricks.com) can answer you?
  
  --
  Ye Xianjin
  Sent with Sparrow
  
  On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:
  
  Hi, Xianjin
  
  I checked user@spark.apache.org (mailto:user@spark.apache.org), and found 
  my post there:
  http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
  
  I am using nabble to send this mail, which indicates that the mail will be
  sent from my email address to the u...@spark.incubator.apache.org 
  (mailto:u...@spark.incubator.apache.org) mailing
  list.
  
  Do the two mailing lists share messages ?
  
  Do we have a nabble interface for user@spark.apache.org 
  (mailto:user@spark.apache.org) mail list ?
  
  Thank you.
  
  
  
  
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com 
  (http://Nabble.com).
  
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  (mailto:user-unsubscr...@spark.apache.org)
  For additional commands, e-mail: user-h...@spark.apache.org 
  (mailto:user-h...@spark.apache.org)
  
 
 
 




Re: groupBy gives non deterministic results

2014-09-09 Thread Davies Liu
What's the type of the key?

If the hash of key is different across slaves, then you could get this confusing
results. We had met this similar results in Python, because of hash of None
is different across machines.

Davies

On Mon, Sep 8, 2014 at 8:16 AM, redocpot julien19890...@gmail.com wrote:
 Update:

 Just test with HashPartitioner(8) and count on each partition:

 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657591*), (*6,658327*), (*7,658434*)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657594)*, (6,658326), (*7,658434*)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657592)*, (6,658326), (*7,658435*)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657591)*, (6,658326), (7,658434)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657592)*, (6,658326), (7,658435)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657592)*, (6,658326), (7,658435)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657592)*, (6,658326), (7,658435)),
 List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
 *(5,657591)*, (6,658326), (7,658435))

 The result is not identical for each execution.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and 
what's your env setup? single node or cluster?

Sent from my iPhone

 On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote:
 
 Hi,
 
 I have a key-value RDD called rdd below. After a groupBy, I tried to count
 rows.
 But the result is not unique, somehow non deterministic.
 
 Here is the test code:
 
  val step1 = ligneReceipt_cleTable.persist
  val step2 = step1.groupByKey
 
  val s1size = step1.count
  val s2size = step2.count
 
  val t = step2 // rdd after groupBy
 
  val t1 = t.count
  val t2 = t.count
  val t3 = t.count
  val t4 = t.count
  val t5 = t.count
  val t6 = t.count
  val t7 = t.count
  val t8 = t.count
 
  println(s1size =  + s1size)
  println(s2size =  + s2size)
  println(1 =  + t1)
  println(2 =  + t2)
  println(3 =  + t3)
  println(4 =  + t4)
  println(5 =  + t5)
  println(6 =  + t6)
  println(7 =  + t7)
  println(8 =  + t8)
 
 Here are the results:
 
 s1size = 5338864
 s2size = 5268001
 1 = 5268002
 2 = 5268001
 3 = 5268001
 4 = 5268002
 5 = 5268001
 6 = 5268002
 7 = 5268002
 8 = 5268001
 
 Even if the difference is just one row, that's annoying.  
 
 Any idea ?
 
 Thank you.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-09 Thread redocpot
Thank you for your replies.

More details here:

The prog is executed on local mode (single node). Default env params are
used.

The test code and the result are in this gist:
https://gist.github.com/coderh/0147467f0b185462048c

Here is 10 first lines of the data: 3 fields each row, the delimiter is ;

3801959;11775022;118
3801960;14543202;118
3801984;11781380;20
3801984;13255417;20
3802003;11777557;91
3802055;11781159;26
3802076;11782793;102
3802086;17881551;102
3802087;19064728;99
3802105;12760994;99
...

There are 27 partitions(small files). Total size is about 100 Mb.

We find that this problem is highly probably caused by the bug SPARK-2043:
https://issues.apache.org/jira/browse/SPARK-2043

Could someone give more details on this bug ?

The pull request say: 

The current implementation reads one key with the next hash code as it
finishes reading the keys with the current hash code, which may cause it to
miss some matches of the next key. This can cause operations like join to
give the wrong result when reduce tasks spill to disk and there are hash
collisions, as values won't be matched together. This PR fixes it by not
reading in that next key, using a peeking iterator instead.

I don't understand why reading a key with the next hash code will cause it
to miss some matches of the next key. If someone could show me some code to
dig in, it's highly appreciated. =)

Thank you.

Hao.











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-09 Thread Davies Liu
Which version of Spark are you using?

This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to
one of these versions
to verify it?

Davies

On Tue, Sep 9, 2014 at 7:03 AM, redocpot julien19890...@gmail.com wrote:
 Thank you for your replies.

 More details here:

 The prog is executed on local mode (single node). Default env params are
 used.

 The test code and the result are in this gist:
 https://gist.github.com/coderh/0147467f0b185462048c

 Here is 10 first lines of the data: 3 fields each row, the delimiter is ;

 3801959;11775022;118
 3801960;14543202;118
 3801984;11781380;20
 3801984;13255417;20
 3802003;11777557;91
 3802055;11781159;26
 3802076;11782793;102
 3802086;17881551;102
 3802087;19064728;99
 3802105;12760994;99
 ...

 There are 27 partitions(small files). Total size is about 100 Mb.

 We find that this problem is highly probably caused by the bug SPARK-2043:
 https://issues.apache.org/jira/browse/SPARK-2043

 Could someone give more details on this bug ?

 The pull request say:

 The current implementation reads one key with the next hash code as it
 finishes reading the keys with the current hash code, which may cause it to
 miss some matches of the next key. This can cause operations like join to
 give the wrong result when reduce tasks spill to disk and there are hash
 collisions, as values won't be matched together. This PR fixes it by not
 reading in that next key, using a peeking iterator instead.

 I don't understand why reading a key with the next hash code will cause it
 to miss some matches of the next key. If someone could show me some code to
 dig in, it's highly appreciated. =)

 Thank you.

 Hao.











 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



groupBy gives non deterministic results

2014-09-08 Thread redocpot
Hi,

I have a key-value RDD called rdd below. After a groupBy, I tried to count
rows.
But the result is not unique, somehow non deterministic.

Here is the test code:

  val step1 = ligneReceipt_cleTable.persist
  val step2 = step1.groupByKey
  
  val s1size = step1.count
  val s2size = step2.count

  val t = step2 // rdd after groupBy

  val t1 = t.count
  val t2 = t.count
  val t3 = t.count
  val t4 = t.count
  val t5 = t.count
  val t6 = t.count
  val t7 = t.count
  val t8 = t.count

  println(s1size =  + s1size)
  println(s2size =  + s2size)
  println(1 =  + t1)
  println(2 =  + t2)
  println(3 =  + t3)
  println(4 =  + t4)
  println(5 =  + t5)
  println(6 =  + t6)
  println(7 =  + t7)
  println(8 =  + t8)

Here are the results:

s1size = 5338864
s2size = 5268001
1 = 5268002
2 = 5268001
3 = 5268001
4 = 5268002
5 = 5268001
6 = 5268002
7 = 5268002
8 = 5268001

Even if the difference is just one row, that's annoying.  

Any idea ?

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-08 Thread redocpot
Update:

Just test with HashPartitioner(8) and count on each partition:

List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591*), (*6,658327*), (*7,658434*)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657594)*, (6,658326), (*7,658434*)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (*7,658435*)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591)*, (6,658326), (7,658434)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (7,658435)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (7,658435)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (7,658435)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591)*, (6,658326), (7,658435))

The result is not identical for each execution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13702.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org