[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-12-11 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-25 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I do some performance test between use skew join algorithm and not use skew 
join  algorithm.
I generate 2 table with 1/5 data skew in table S and 1/1 data skew in 
table R. Two table skew in same key.

spark.sql.adaptive.skewjoin.threshold   600
spark.sql.adaptive.shuffle.targetPostShuffleInputSize   500
record: S 1000 rows; R 1 rows
sql:
select count(*) from R,S where rid=sid and sname>'wang9' and rname > 
'zhang9';

skew algorithm : 167.695s
normal algorithm: 303.922s

R2_txt is 1 rows without data skew.
sql: select count(*) from R2_txt,S where rid=sid and sname>'wang' and rname 
> 'zhang9';
skew algorithm : 38.717s
normal algorithm: 114.21s



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-24 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
skewed join implementation suit for dataframe and sql statement
you will get 210 output files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-22 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
@tgravescs : In join case,some like : select count(*) from A join B.  if 
the parameter spark.sql.shuffle.partitions=200 ,then we get 200 tasks output 
about 'count num', the output is not in HDFS but cache in spark . Calculate the 
sum of 200 tasks. we got the correct value.  If skewed. wo get 210 tasks  
output about 'count num'.  it's some processing about next step.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

2016-10-20 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
@tgravescs :
Thank you for your response, when a single reduce task handling huge data, 
it's slowly and unstable. so we split one reduce task to multi- reduce task.
A single reduce task doing like A join B. we split to multi-task. task 1 
doing A1 join B,  task 2 dong A2 join B and so on.  A1 is a part of A which 
read from a range of maps output.  For spark sql, it is the A1 as a  separate 
partitions when processing. so it can use mutil-executor to run the task.  for 
dispersion the process pressure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-18 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
following the review comment, I rewrite code for read a range of 
maps。like this:
class BlockStoreShuffleReader[K, C](
handle: BaseShuffleHandle[K, _, C],
startPartition: Int,
endPartition: Int,
context: TaskContext,
serializerManager: SerializerManager = SparkEnv.get.serializerManager,
blockManager: BlockManager = SparkEnv.get.blockManager,
mapOutputTracker: MapOutputTracker = SparkEnv.get.mapOutputTracker,
startMapId: Option[Int] = None,
endMapId: Option[Int] = None)

To decide how many range for read from the maps。Use the 
spark.sql.adaptive.skewjoin.threshold value。We think the output size less 
than the skew threshold, It can handling in one task,else we split to many 
task,which every one task handing data size slightly less the skew threshold



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-14 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
I will rewrite the read ShuffleReader  interface , for read a range of maps 
but not only read a map data.
it will be finished soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-10-11 Thread YuhuWang2002
Github user YuhuWang2002 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15297#discussion_r82779299
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -138,13 +138,16 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
* and the second item is a sequence of (shuffle block id, 
shuffle block size) tuples
* describing the shuffle blocks that are stored at that block 
manager.
*/
-  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, 
endPartition: Int)
+  def getMapSizesByExecutorId(shuffleId: Int, startPartition: Int, 
endPartition: Int,
+  mapid: Int = -1)
--- End diff --

it's a good idea. the seq[Int] parameter can fetch more maps data. it can 
reduce the task num


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-10-11 Thread YuhuWang2002
Github user YuhuWang2002 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15297#discussion_r82779060
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -687,18 +691,21 @@ private[spark] object MapOutputTracker extends 
Logging {
   shuffleId: Int,
   startPartition: Int,
   endPartition: Int,
-  statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, 
Long)])] = {
+  statuses: Array[MapStatus],
+  mapIdx: Int = -1): Seq[(BlockManagerId, Seq[(BlockId, Long)])] = {
--- End diff --

it's conflicts with mapId


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-10 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15297: [WIP][SPARK-9862]Handling data skew

2016-10-08 Thread YuhuWang2002
Github user YuhuWang2002 commented on the issue:

https://github.com/apache/spark/pull/15297
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15297: [WIP][SPARK-9862]Handling data skew

2016-09-29 Thread YuhuWang2002
GitHub user YuhuWang2002 opened a pull request:

https://github.com/apache/spark/pull/15297

[WIP][SPARK-9862]Handling data skew

## What changes were proposed in this pull request?

As https://issues.apache.org/jira/browse/SPARK-9862 said, handling data 
skew when join.


## How was this patch tested?

Unit tests in ExchangeCoordinatorSuite

also can generate skew data and  manual test

Author: wangyuhu<wangyuhu2...@126.com>





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YuhuWang2002/spark-1 skewjoin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15297


commit ef1baae1768dc8fd676f617bf7c1e85d72179ba8
Author: wangyuhu <wangy...@huawei.com>
Date:   2016-09-28T02:46:49Z

[SPARK-9862] handling data skew , add skew join feature

commit c561ea718fd65adc0f1187097b9da88fc0054192
Author: wangyuhu <wangy...@huawei.com>
Date:   2016-09-28T08:41:46Z

code style fix

commit 9025e24b6552b39bd3ab20632b702b60edc2ad10
Author: wangyuhu <wangy...@huawei.com>
Date:   2016-09-29T10:56:10Z

add comment

commit 0ba86a2284e684a733f989ed6e595575f511c8bd
Author: wangyuhu <wangy...@huawei.com>
Date:   2016-09-29T11:38:26Z

modify UT code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org