subject:"\[jira\] \[Commented\] \(HIVE\-7334\) Create SparkShuffler, shuffling data between map\-side data processing and reduce\-side processing"

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-31 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081430#comment-14081430
 ] 

Xuefu Zhang commented on HIVE-7334:
---

[~lirui] Please feel free to create smaller JIRAs to enable sorting in Hive on 
Spark. Here are some ideas:

1. Complete SortByShuffler
2. Add logic in SparkCompiler to generate SparkEdgeProperty with right sorting 
property.
3. Add logic in SparkPlanGenerator to generate plan with right shuffle type.
4. Test Hive's sorting related queries to make sure they work. File JIRAs for 
problems found.

Also, please take a look at the link [~rxin] pointed out above to see if we can 
benefit in any way.

 Create SparkShuffler, shuffling data between map-side data processing and 
 reduce-side processing
 

 Key: HIVE-7334
 URL: https://issues.apache.org/jira/browse/HIVE-7334
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-7334.patch


 Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-29 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078764#comment-14078764
 ] 

Rui Li commented on HIVE-7334:
--

Just some initial ground work. Submitted for review :)

 Create SparkShuffler, shuffling data between map-side data processing and 
 reduce-side processing
 

 Key: HIVE-7334
 URL: https://issues.apache.org/jira/browse/HIVE-7334
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-7334.patch


 Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-29 Thread Xuefu Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078830#comment-14078830
]

Xuefu Zhang commented on HIVE-7334:
---

[~lirui] Thanks for the patch. I took a brief look, and found you might need to
rebase your patch with the latest branch. On the top level, here is the plan
for sortBy, groupBy, and HiveReduceFunction. Also, please note that there are
some overlap between your work and [~robustchao]'s HIVE-7526. I'd like to make
clear so that we don't overstep each other's toe.

1. We will use groupBy unless sorting is required. For this, we need to change
HiveReduceFunction API. (Chao)
2. Since sortBy and groupBy generate different type data sets, we will need to
cluster rows from sortBy and match the input of HiveReduceFunction. We will
create a subclass of SparkTran for row clustering. The cluster should be
simpler than the existing one in HiveReduceFunction as we assume that the key
are ordered. Thus, we accumulate rows with the same key. (Chao)
3. We have ShuffleTran for shuffling. Currently it only uses paritionByKey().
We will change it to groupBy. (Chao)
4. We will add logic in SparkCompiler/SparkPlanGenerator to determine which
which shuffle to use: either groupBy + ReduceTran or sortBy + RowClusteringTran
+ ReduceTran. (Rui)
5. Make sure Hive's order by, sort by, distributed by, and clustered by work
(Rui).
6. It seems that we don't need partitionByKey.

Please work together with Chao to move this forward.

In addition, I'd like you to find out what takes to support shuffling required
for Hive's reduce-side join. If there is anything missing in Spark, please
create corresponding JIRAs.

Let me know if you have any questions.

Create SparkShuffler, shuffling data between map-side data processing and
reduce-side processing

Key: HIVE-7334
URL: https://issues.apache.org/jira/browse/HIVE-7334
Project: Hive
Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Rui Li
Attachments: HIVE-7334.patch

Please refer to the design spec.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-29 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078936#comment-14078936
 ] 

Rui Li commented on HIVE-7334:
--

Thanks [~xuefuz] this is much clearer.

 Create SparkShuffler, shuffling data between map-side data processing and 
 reduce-side processing
 

 Key: HIVE-7334
 URL: https://issues.apache.org/jira/browse/HIVE-7334
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-7334.patch


 Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

2014-07-29 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078939#comment-14078939
 ] 

Reynold Xin commented on HIVE-7334:
---

BTW definitely look at https://github.com/apache/spark/pull/1499

 Create SparkShuffler, shuffling data between map-side data processing and 
 reduce-side processing
 

 Key: HIVE-7334
 URL: https://issues.apache.org/jira/browse/HIVE-7334
 Project: Hive
  Issue Type: Sub-task
Reporter: Xuefu Zhang
Assignee: Rui Li
 Attachments: HIVE-7334.patch


 Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

[jira] [Commented] (HIVE-7334) Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing

5 matches

Site Navigation

Mail list logo

Footer information