I think Xuefeng Wu's suggestion is likely correct. This different is more
likely explained by the compression library changing versions than sort vs
hash shuffle (which should not affect output size significantly). Others
have reported that switching to lz4 fixed their issue.
We should document
:14 (GMT+09:00)
*Title* : Re: Shuffle write increases in spark 1.2
If you have a small reproduction for this issue, can you open a ticket at
https://issues.apache.org/jira/browse/SPARK ?
On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com)
wrote:
Hi all,
The size
I double check the 1.2 feature list and found out that the new sort-based
shuffle manager has nothing to do with HashPartitioner :- Sorry for the
misinformation.
In another hand. This may explain increase in shuffle spill as a side effect
of the new shuffle manager, let me revert
Same problem here, shuffle write increased from 10G to over 64G, since I'm
running on amazon EC2 this always cause temporary folder to consume all the
disk space. Still looking for a solution.
BTW, the 64G shuffle write is encountered on shuffling a pairRDD with
HashPartitioner, so its not
Hello,
as the original message never got accepted to the mailinglist, I quote it
here completely:
Kevin Jung wrote
Hi all,
The size of shuffle write showing in spark web UI is much different when I
execute same spark job on same input data(100GB) in both spark 1.1 and
spark 1.2.
At the
Hello,
as the original message from Kevin Jung never got accepted to the
mailinglist, I quote it here completely:
Kevin Jung wrote
Hi all,
The size of shuffle write showing in spark web UI is much different when I
execute same spark job on same input data(100GB) in both spark 1.1 and
spark
/browse/SPARK-5081
--- *Original Message* ---
*Sender* : Josh Rosenrosenvi...@gmail.com
*Date* : 2015-01-05 06:14 (GMT+09:00)
*Title* : Re: Shuffle write increases in spark 1.2
If you have a small reproduction for this issue, can you open a ticket at
https://issues.apache.org/jira
Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081
--- Original Message ---
Sender : Josh Rosenrosenvi...@gmail.com
Date : 2015-01-05 06:14 (GMT+09:00)
Title : Re: Shuffle write increases in spark 1.2
If you have a small reproduction for this issue
Hi all,
The size of shuffle write showing in spark web UI is mush different when I
execute same spark job on same input data(100GB) in both spark 1.1 and spark
1.2.
At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1
but 91.0GB in spark 1.2.
I set spark.shuffle.manager