Re: groupBy gives non deterministic results
Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Great. And you should ask question in user@spark.apache.org mail list. I believe many people don't subscribe the incubator mail list now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote: Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: groupBy gives non deterministic results
Ah, thank you. I did not notice that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
| Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com can answer you? -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote: Hi, Xianjin I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org) mailing list. Do the two mailing lists share messages ? Do we have a nabble interface for user@spark.apache.org (mailto:user@spark.apache.org) mail list ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: groupBy gives non deterministic results
I think the mails to spark.incubator.apache.org will be forwarded to spark.apache.org. Here is the header of the first mail: from: redocpot julien19890...@gmail.com to: u...@spark.incubator.apache.org date: Mon, Sep 8, 2014 at 7:29 AM subject: groupBy gives non deterministic results mailing list: user.spark.apache.org Filter messages from this mailing list mailed-by: spark.apache.org I only subscribe spark.apache.org, and I do see all the mails from he. On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com wrote: | Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com can answer you? -- Ye Xianjin Sent with Sparrow On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote: Hi, Xianjin I checked user@spark.apache.org, and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org mailing list. Do the two mailing lists share messages ? Do we have a nabble interface for user@spark.apache.org mail list ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Well, That's weird. I don't see this thread in my mail box as sending to user list. Maybe because I also subscribe the incubator mail list? I do see mails sending to incubator mail list and no one replies. I thought it was because people don't subscribe the incubator now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday, September 11, 2014 at 12:12 AM, Davies Liu wrote: I think the mails to spark.incubator.apache.org (http://spark.incubator.apache.org) will be forwarded to spark.apache.org (http://spark.apache.org). Here is the header of the first mail: from: redocpot julien19890...@gmail.com (mailto:julien19890...@gmail.com) to: u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org) date: Mon, Sep 8, 2014 at 7:29 AM subject: groupBy gives non deterministic results mailing list: user.spark.apache.org (http://user.spark.apache.org) Filter messages from this mailing list mailed-by: spark.apache.org (http://spark.apache.org) I only subscribe spark.apache.org (http://spark.apache.org), and I do see all the mails from he. On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com (mailto:advance...@gmail.com) wrote: | Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com (mailto:dav...@databricks.com) can answer you? -- Ye Xianjin Sent with Sparrow On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote: Hi, Xianjin I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org) mailing list. Do the two mailing lists share messages ? Do we have a nabble interface for user@spark.apache.org (mailto:user@spark.apache.org) mail list ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: groupBy gives non deterministic results
What's the type of the key? If the hash of key is different across slaves, then you could get this confusing results. We had met this similar results in Python, because of hash of None is different across machines. Davies On Mon, Sep 8, 2014 at 8:16 AM, redocpot julien19890...@gmail.com wrote: Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (*7,658435*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591)*, (6,658326), (7,658434)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591)*, (6,658326), (7,658435)) The result is not identical for each execution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13702.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Can you provide small sample or test data that reproduce this problem? and what's your env setup? single node or cluster? Sent from my iPhone On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote: Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size = step2.count val t = step2 // rdd after groupBy val t1 = t.count val t2 = t.count val t3 = t.count val t4 = t.count val t5 = t.count val t6 = t.count val t7 = t.count val t8 = t.count println(s1size = + s1size) println(s2size = + s2size) println(1 = + t1) println(2 = + t2) println(3 = + t3) println(4 = + t4) println(5 = + t5) println(6 = + t6) println(7 = + t7) println(8 = + t8) Here are the results: s1size = 5338864 s2size = 5268001 1 = 5268002 2 = 5268001 3 = 5268001 4 = 5268002 5 = 5268001 6 = 5268002 7 = 5268002 8 = 5268001 Even if the difference is just one row, that's annoying. Any idea ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter is ; 3801959;11775022;118 3801960;14543202;118 3801984;11781380;20 3801984;13255417;20 3802003;11777557;91 3802055;11781159;26 3802076;11782793;102 3802086;17881551;102 3802087;19064728;99 3802105;12760994;99 ... There are 27 partitions(small files). Total size is about 100 Mb. We find that this problem is highly probably caused by the bug SPARK-2043: https://issues.apache.org/jira/browse/SPARK-2043 Could someone give more details on this bug ? The pull request say: The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. I don't understand why reading a key with the next hash code will cause it to miss some matches of the next key. If someone could show me some code to dig in, it's highly appreciated. =) Thank you. Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Which version of Spark are you using? This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to one of these versions to verify it? Davies On Tue, Sep 9, 2014 at 7:03 AM, redocpot julien19890...@gmail.com wrote: Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter is ; 3801959;11775022;118 3801960;14543202;118 3801984;11781380;20 3801984;13255417;20 3802003;11777557;91 3802055;11781159;26 3802076;11782793;102 3802086;17881551;102 3802087;19064728;99 3802105;12760994;99 ... There are 27 partitions(small files). Total size is about 100 Mb. We find that this problem is highly probably caused by the bug SPARK-2043: https://issues.apache.org/jira/browse/SPARK-2043 Could someone give more details on this bug ? The pull request say: The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. I don't understand why reading a key with the next hash code will cause it to miss some matches of the next key. If someone could show me some code to dig in, it's highly appreciated. =) Thank you. Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
groupBy gives non deterministic results
Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size = step2.count val t = step2 // rdd after groupBy val t1 = t.count val t2 = t.count val t3 = t.count val t4 = t.count val t5 = t.count val t6 = t.count val t7 = t.count val t8 = t.count println(s1size = + s1size) println(s2size = + s2size) println(1 = + t1) println(2 = + t2) println(3 = + t3) println(4 = + t4) println(5 = + t5) println(6 = + t6) println(7 = + t7) println(8 = + t8) Here are the results: s1size = 5338864 s2size = 5268001 1 = 5268002 2 = 5268001 3 = 5268001 4 = 5268002 5 = 5268001 6 = 5268002 7 = 5268002 8 = 5268001 Even if the difference is just one row, that's annoying. Any idea ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (*7,658435*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591)*, (6,658326), (7,658434)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591)*, (6,658326), (7,658435)) The result is not identical for each execution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13702.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org