[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2018-03-06 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388915#comment-16388915 ] Rui Li commented on HIVE-15104: --- [~stakiar], thanks for trying this out. bq. The HiveKryoReg

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2018-03-06 Thread Sahil Takiar (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388551#comment-16388551 ] Sahil Takiar commented on HIVE-15104: - Hey [~lirui] I found some time to do some inter

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-31 Thread Lefty Leverenz (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233602#comment-16233602 ] Lefty Leverenz commented on HIVE-15104: --- Good doc, thanks [~lirui]. I removed the T

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-29 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224326#comment-16224326 ] Rui Li commented on HIVE-15104: --- Thanks [~leftylev] for the reminder. I've updated the wiki.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-25 Thread Lefty Leverenz (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219861#comment-16219861 ] Lefty Leverenz commented on HIVE-15104: --- Doc note: This adds *hive.spark.optimize.s

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-24 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217849#comment-16217849 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-24 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217431#comment-16217431 ] Xuefu Zhang commented on HIVE-15104: +1 > Hive on Spark generate more shuffle data th

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-18 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208904#comment-16208904 ] Rui Li commented on HIVE-15104: --- The sub-query failures are tracked by HIVE-17823. Others ar

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-17 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207563#comment-16207563 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-16 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207028#comment-16207028 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-16 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205863#comment-16205863 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-13 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203867#comment-16203867 ] Xuefu Zhang commented on HIVE-15104: I think it's fairly safe to assume that hive-exec

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-13 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203332#comment-16203332 ] Rui Li commented on HIVE-15104: --- [~xuefuz], we need to locate the jar on Hive side, before w

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-12 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202897#comment-16202897 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], to locate the jar, can we assume tha

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-12 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201695#comment-16201695 ] Rui Li commented on HIVE-15104: --- One correction: the {{NoClassDefFoundError}} is for {{com.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-10-11 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201405#comment-16201405 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], sorry for taking so long to update. I tried o

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-31 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149777#comment-16149777 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], I think creating a trivial package i

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148485#comment-16148485 ] Rui Li commented on HIVE-15104: --- [~xuefuz], I'll try if that's feasible. Do you think it's O

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148351#comment-16148351 ] Xuefu Zhang commented on HIVE-15104: I see. It might be possible to put this class in

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148340#comment-16148340 ] Rui Li commented on HIVE-15104: --- [~xuefuz], my previous [comment|https://issues.apache.org/

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-30 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148174#comment-16148174 ] Xuefu Zhang commented on HIVE-15104: The patch looks good to me. My only concern is ab

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-25 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141431#comment-16141431 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-20 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134639#comment-16134639 ] Rui Li commented on HIVE-15104: --- Thanks [~xuefuz] and take your time. I guess we can also ru

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-18 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133290#comment-16133290 ] Xuefu Zhang commented on HIVE-15104: [~lirui], I found it difficulty to backport HIVE-

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-16 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129895#comment-16129895 ] Xuefu Zhang commented on HIVE-15104: Hi [~lirui], thanks for continuing the work here.

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-08-16 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129871#comment-16129871 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], with HIVE-17114 and HIVE-17321 the benchmark

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085465#comment-16085465 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085384#comment-16085384 ] Rui Li commented on HIVE-15104: --- Hi [~xuefuz], I can't reproduce the perf degradation on my

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-13 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085324#comment-16085324 ] Xuefu Zhang commented on HIVE-15104: [~lirui], I'm wondering if there is anything new

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-07-12 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085276#comment-16085276 ] Rui Li commented on HIVE-15104: --- I also run another round of TPC-DS. The overall shuffle dat

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-06-16 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052006#comment-16052006 ] Rui Li commented on HIVE-15104: --- The approach here can cause problem when we cache RDDs, e.g

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-20 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018507#comment-16018507 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-19 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017621#comment-16017621 ] Rui Li commented on HIVE-15104: --- Patch v3 compiles the registrators at runtime, so that we d

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-19 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017527#comment-16017527 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-15 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010433#comment-16010433 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-12 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009026#comment-16009026 ] Rui Li commented on HIVE-15104: --- [~xuefuz], kryo was relocated in HIVE-5915. So it's not int

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-12 Thread Hive QA (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008502#comment-16008502 ] Hive QA commented on HIVE-15104: Here are the results of testing the latest attachment:

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-12 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008107#comment-16008107 ] Xuefu Zhang commented on HIVE-15104: [~lirui], great progress! Thanks for keeping up t

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-05 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998437#comment-15998437 ] Rui Li commented on HIVE-15104: --- Tried disabling relocation locally. It does solve the Abstr

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-05-05 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998177#comment-15998177 ] Rui Li commented on HIVE-15104: --- I looked at the shuffle writers of Spark and none of them s

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-04-07 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960801#comment-15960801 ] Aihua Xu commented on HIVE-15104: - [~lirui] I didn't have time to work on that . Feel free

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2017-04-06 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960146#comment-15960146 ] Rui Li commented on HIVE-15104: --- Hi [~aihuaxu], are you still working on this? If not, do yo

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-07 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644296#comment-15644296 ] Aihua Xu commented on HIVE-15104: - I will take a look at Spark to see if it's needed after

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634914#comment-15634914 ] Rui Li commented on HIVE-15104: --- [~xuefuz], [~aihuaxu], both MR and Spark need HiveKey.hashC

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634112#comment-15634112 ] Xuefu Zhang commented on HIVE-15104: I checked the source code and it seems that both

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633820#comment-15633820 ] Xuefu Zhang commented on HIVE-15104: [~lirui], thanks for sharing your findings. Can y

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632623#comment-15632623 ] Aihua Xu commented on HIVE-15104: - [~lirui] So what you are saying is, it depends on how s

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-03 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632244#comment-15632244 ] Rui Li commented on HIVE-15104: --- [~xuefuz], here's what I find so far. Firstly, MR uses Hive

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Xuefu Zhang (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631376#comment-15631376 ] Xuefu Zhang commented on HIVE-15104: This is rather interesting. I know I originally r

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631053#comment-15631053 ] Rui Li commented on HIVE-15104: --- We need to use HiveKey because it holds the proper hash cod

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630435#comment-15630435 ] Aihua Xu commented on HIVE-15104: - This is changed by HIVE-8017. [~lirui] Do you recall wh

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-02 Thread Rui Li (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628605#comment-15628605 ] Rui Li commented on HIVE-15104: --- Seems MR can just serialize the key as BytesWritable instea

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-01 Thread wangwenli (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627812#comment-15627812 ] wangwenli commented on HIVE-15104: -- try select count(distinct col1), count (distinct col2

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2016-11-01 Thread Aihua Xu (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626085#comment-15626085 ] Aihua Xu commented on HIVE-15104: - [~wenli] Can you give an example that I can run and com