date:20170102

[jira] [Commented] (HIVE-15528) Expose Spark job error in SparkTask

2017-01-02 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792519#comment-15792519
 ] 

Hive QA commented on HIVE-15528:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845276/HIVE-15528.000.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10879 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=139)

[skewjoinopt15.q,vector_coalesce.q,orc_ppd_decimal.q,cbo_rp_lineage2.q,insert_into_with_schema.q,join_emit_interval.q,load_dyn_part3.q,auto_sortmerge_join_14.q,vector_null_projection.q,vector_cast_constant.q,mapjoin2.q,bucket_map_join_tez2.q,correlationoptimizer4.q,schema_evol_orc_acidvec_part_update.q,vectorization_12.q,vector_number_compare_projection.q,orc_merge_incompat3.q,vector_leftsemi_mapjoin.q,update_all_non_partitioned.q,multi_column_in_single.q,schema_evol_orc_nonvec_table.q,cbo_rp_semijoin.q,tez_insert_overwrite_local_directory_1.q,schema_evol_text_vecrow_table.q,vector_count.q,auto_sortmerge_join_15.q,vector_if_expr.q,delete_whole_partition.q,vector_decimal_6.q,sample1.q]
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] 
(batchId=92)
org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver.org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver
 (batchId=228)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2758/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2758/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2758/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845276 - PreCommit-HIVE-Build

> Expose Spark job error in SparkTask
> ---
>
> Key: HIVE-15528
> URL: https://issues.apache.org/jira/browse/HIVE-15528
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.2.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: HIVE-15528.000.patch
>
>
> Expose Spark job error in SparkTask by propagating Spark job error to task 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-02 Thread Pengcheng Xiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793297#comment-15793297
 ] 

Pengcheng Xiong commented on HIVE-15529:


[~rajesh.balamohan], this sounds related to HIVE-15467?

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Priority: Critical
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15507) Nested column pruning: fix issue when selecting struct field from array/map element

2017-01-02 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated HIVE-15507:

  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0
  Status: Resolved  (was: Patch Available)

Committed to the master branch. Thanks [~Ferd] for the review!

> Nested column pruning: fix issue when selecting struct field from array/map 
> element
> ---
>
> Key: HIVE-15507
> URL: https://issues.apache.org/jira/browse/HIVE-15507
> Project: Hive
>  Issue Type: Sub-task
>  Components: Logical Optimizer, Physical Optimizer, 
> Serializers/Deserializers
>Affects Versions: 2.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
> Fix For: 2.2.0
>
> Attachments: 15507.1.patch
>
>
> When running the following query:
> {code}
> SELECT count(col), arr[0].f
> FROM tbl
> GROUP BY arr[0].f
> {code}
> where {{arr}} is an array of struct with field {{f}}. Nested column pruning 
> will fail. This is because we currently process {{GenericUDFIndex}} in the 
> same way as any other UDF. In this case, it will generate path {{arr.f}}, 
> which will not match the struct type info when doing the pruning.
> Same thing for map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread Chao Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793570#comment-15793570
 ] 

Chao Sun commented on HIVE-15527:
-

Patch looks good. Any idea why the qfile result is different?

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2> next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2>(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2>(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15525) Hooking ChangeManager to "drop table", "drop partition"

2017-01-02 Thread Thejas M Nair (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793805#comment-15793805
 ] 

Thejas M Nair commented on HIVE-15525:
--

[~daijy] can you please include a reviewboard link or pull request ?


> Hooking ChangeManager to "drop table", "drop partition"
> ---
>
> Key: HIVE-15525
> URL: https://issues.apache.org/jira/browse/HIVE-15525
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: HIVE-15525.1.patch
>
>
> When Hive "drop table"/"drop partition", we will move data files into cmroot 
> in case the replication destination will need it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15481) Support multiple and nested subqueries

2017-01-02 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15481:
---
Status: Open  (was: Patch Available)

> Support multiple and nested subqueries
> --
>
> Key: HIVE-15481
> URL: https://issues.apache.org/jira/browse/HIVE-15481
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
> Attachments: HIVE-15481.1.patch, HIVE-15481.2.patch, 
> HIVE-15481.3.patch
>
>
> This is continuation of the work done in HIVE-15192. As listed at  
> [Restrictions | 
> https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf ] 
> currently it is not possible to execute queries which either have more than 
> one subquery or have nested subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15481) Support multiple and nested subqueries

2017-01-02 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15481:
---
Attachment: HIVE-15481.4.patch

> Support multiple and nested subqueries
> --
>
> Key: HIVE-15481
> URL: https://issues.apache.org/jira/browse/HIVE-15481
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
> Attachments: HIVE-15481.1.patch, HIVE-15481.2.patch, 
> HIVE-15481.3.patch, HIVE-15481.4.patch
>
>
> This is continuation of the work done in HIVE-15192. As listed at  
> [Restrictions | 
> https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf ] 
> currently it is not possible to execute queries which either have more than 
> one subquery or have nested subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15481) Support multiple and nested subqueries

2017-01-02 Thread Vineet Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15481:
---
Status: Patch Available  (was: Open)

Added back restriction to disable multiple queries with OR since it produces 
wrong results (CALCITE-1546). 

> Support multiple and nested subqueries
> --
>
> Key: HIVE-15481
> URL: https://issues.apache.org/jira/browse/HIVE-15481
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
> Attachments: HIVE-15481.1.patch, HIVE-15481.2.patch, 
> HIVE-15481.3.patch, HIVE-15481.4.patch
>
>
> This is continuation of the work done in HIVE-15192. As listed at  
> [Restrictions | 
> https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf ] 
> currently it is not possible to execute queries which either have more than 
> one subquery or have nested subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15530) Optimize the column stats update logic in table alteration

2017-01-02 Thread Yibing Shi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibing Shi updated HIVE-15530:
--
Description: 
Currently when a table is altered, if any of below conditions is true, HMS 
would try to update column statistics for the table:

# database name is changed
# table name is changed
# old columns and new columns are not the same

As a result, when a column is added to a table, Hive also tries to update 
column statistics, which is not necessary. We can loose the last condition by 
checking whether all existing columns are changed or not. If not, we don't have 
to update stats info.

  was:
Currently when a table is altered, if any of below conditions is false, HMS 
would try to update column statistics for the table:

# database name is changed
# table name is changed
# old columns and new columns are not the same

As a result, when a column is added to a table, Hive also tries to update 
column statistics, which is not necessary. We can loose the last condition by 
checking whether all existing columns are changed or not. If not, we don't have 
to update stats info.


> Optimize the column stats update logic in table alteration
> --
>
> Key: HIVE-15530
> URL: https://issues.apache.org/jira/browse/HIVE-15530
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Yibing Shi
>
> Currently when a table is altered, if any of below conditions is true, HMS 
> would try to update column statistics for the table:
> # database name is changed
> # table name is changed
> # old columns and new columns are not the same
> As a result, when a column is added to a table, Hive also tries to update 
> column statistics, which is not necessary. We can loose the last condition by 
> checking whether all existing columns are changed or not. If not, we don't 
> have to update stats info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793903#comment-15793903
 ] 

Rui Li commented on HIVE-15527:
---

Since the HiveKVResultCache here only stores values for the same key, I think 
we can avoid Ser/De the HiveKey to improve performance?

> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2> next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2>(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2>(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15481) Support multiple and nested subqueries

2017-01-02 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793920#comment-15793920
 ] 

Hive QA commented on HIVE-15481:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845313/HIVE-15481.4.patch

{color:green}SUCCESS:{color} +1 due to 14 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 10920 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_4] 
(batchId=93)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] 
(batchId=92)
org.apache.hadoop.hive.ql.security.TestStorageBasedMetastoreAuthorizationDrops.testDropTable
 (batchId=208)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2759/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2759/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2759/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 10 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845313 - PreCommit-HIVE-Build

> Support multiple and nested subqueries
> --
>
> Key: HIVE-15481
> URL: https://issues.apache.org/jira/browse/HIVE-15481
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
> Attachments: HIVE-15481.1.patch, HIVE-15481.2.patch, 
> HIVE-15481.3.patch, HIVE-15481.4.patch
>
>
> This is continuation of the work done in HIVE-15192. As listed at  
> [Restrictions | 
> https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf ] 
> currently it is not possible to execute queries which either have more than 
> one subquery or have nested subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793967#comment-15793967
 ] 

liyunzhang_intel commented on HIVE-15527:
-

[~xuefuz] and [~lirui]: HiveKVResultCache will write key value pair if buffer 
is full and this will do some limits to the memory usage. But is there anything 
to show that the ArrayList use a lot of memory? test this by memory analysis 
tool?



> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2> next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2>(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2>(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793967#comment-15793967
 ] 

liyunzhang_intel edited comment on HIVE-15527 at 1/3/17 2:53 AM:
-

[~xuefuz] and [~lirui]: HiveKVResultCache will write key value pair to disk if 
buffer is full and this will do some limits to the memory usage. But is there 
anything to show that the ArrayList use a lot of memory? test this by memory 
analysis tool?




was (Author: kellyzly):
[~xuefuz] and [~lirui]: HiveKVResultCache will write key value pair if buffer 
is full and this will do some limits to the memory usage. But is there anything 
to show that the ArrayList use a lot of memory? test this by memory analysis 
tool?



> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2> next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2>(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2>(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

2017-01-02 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794069#comment-15794069
 ] 

Rajesh Balamohan commented on HIVE-15529:
-

[~pxiong] - Yes, on task failure the node gets into disabled state. Will debug 
more on this.

> LLAP: TaskSchedulerService can get stuck when scheduleTask returns 
> DELAYED_RESOURCES
> 
>
> Key: HIVE-15529
> URL: https://issues.apache.org/jira/browse/HIVE-15529
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Priority: Critical
>
> Easier way to simulate the issue:
> 1. Start hive cli with "--hiveconf hive.execution.mode=llap"
> 2. Run a sql script file (e.g sql script containing tpc-ds queries)
> 3. In the middle of the run, press "ctrl+C" which would interrupt the current 
> job. This should not exit the hive cli yet.
> 4. After sometime, launch the same SQL script in same cli. This would get 
> stuck indefinitely (waiting for computing the splits).
> Even when cli is quit, AM runs forever until explicitly killed. 
> Issue seems to be around {{LlapTaskSchedulerService::schedulePendingTasks}} 
> dealing with the loop when it encounters {{DELAYED_RESOURCES}} on task 
> scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2017-01-02 Thread Ferdinand Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794156#comment-15794156
 ] 

Ferdinand Xu commented on HIVE-15313:
-

[~lirui], any progress or plan about HIVE-15302? I think we can resolve this 
ticket firstly since HIVE-15302 doesn't block HIVE-15313? Any suggestions? 
[~xuefuz] [~lirui]

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
> cd $SPARK_HOME/jars
> zip spark-archive.zip ./*.jar # this is important, enter the jars folder then 
> zip
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2017-01-02 Thread Ferdinand Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdinand Xu updated HIVE-15313:

Assignee: liyunzhang_intel

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
> cd $SPARK_HOME/jars
> zip spark-archive.zip ./*.jar # this is important, enter the jars folder then 
> zip
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15530) Optimize the column stats update logic in table alteration

2017-01-02 Thread Yibing Shi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibing Shi updated HIVE-15530:
--
Attachment: HIVE-15530.1.patch

> Optimize the column stats update logic in table alteration
> --
>
> Key: HIVE-15530
> URL: https://issues.apache.org/jira/browse/HIVE-15530
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Yibing Shi
> Attachments: HIVE-15530.1.patch
>
>
> Currently when a table is altered, if any of below conditions is true, HMS 
> would try to update column statistics for the table:
> # database name is changed
> # table name is changed
> # old columns and new columns are not the same
> As a result, when a column is added to a table, Hive also tries to update 
> column statistics, which is not necessary. We can loose the last condition by 
> checking whether all existing columns are changed or not. If not, we don't 
> have to update stats info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HIVE-15530) Optimize the column stats update logic in table alteration

2017-01-02 Thread Yibing Shi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibing Shi reassigned HIVE-15530:
-

Assignee: Yibing Shi

> Optimize the column stats update logic in table alteration
> --
>
> Key: HIVE-15530
> URL: https://issues.apache.org/jira/browse/HIVE-15530
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Yibing Shi
>Assignee: Yibing Shi
> Attachments: HIVE-15530.1.patch
>
>
> Currently when a table is altered, if any of below conditions is true, HMS 
> would try to update column statistics for the table:
> # database name is changed
> # table name is changed
> # old columns and new columns are not the same
> As a result, when a column is added to a table, Hive also tries to update 
> column statistics, which is not necessary. We can loose the last condition by 
> checking whether all existing columns are changed or not. If not, we don't 
> have to update stats info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15530) Optimize the column stats update logic in table alteration

2017-01-02 Thread Yibing Shi (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibing Shi updated HIVE-15530:
--
Status: Patch Available  (was: Open)

> Optimize the column stats update logic in table alteration
> --
>
> Key: HIVE-15530
> URL: https://issues.apache.org/jira/browse/HIVE-15530
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Yibing Shi
> Attachments: HIVE-15530.1.patch
>
>
> Currently when a table is altered, if any of below conditions is true, HMS 
> would try to update column statistics for the table:
> # database name is changed
> # table name is changed
> # old columns and new columns are not the same
> As a result, when a column is added to a table, Hive also tries to update 
> column statistics, which is not necessary. We can loose the last condition by 
> checking whether all existing columns are changed or not. If not, we don't 
> have to update stats info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15530) Optimize the column stats update logic in table alteration

2017-01-02 Thread Pengcheng Xiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794176#comment-15794176
 ] 

Pengcheng Xiong commented on HIVE-15530:


[~Yibing], could u add a test case for this? Thanks.

> Optimize the column stats update logic in table alteration
> --
>
> Key: HIVE-15530
> URL: https://issues.apache.org/jira/browse/HIVE-15530
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Yibing Shi
>Assignee: Yibing Shi
> Attachments: HIVE-15530.1.patch
>
>
> Currently when a table is altered, if any of below conditions is true, HMS 
> would try to update column statistics for the table:
> # database name is changed
> # table name is changed
> # old columns and new columns are not the same
> As a result, when a column is added to a table, Hive also tries to update 
> column statistics, which is not necessary. We can loose the last condition by 
> checking whether all existing columns are changed or not. If not, we don't 
> have to update stats info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HIVE-8373) OOM for a simple query with spark.master=local [Spark Branch]

2017-01-02 Thread Ferdinand Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdinand Xu resolved HIVE-8373.

   Resolution: Fixed
Fix Version/s: 2.2.0

WIKI has been updated. Thanks [~kellyzly] for the contributions.

> OOM for a simple query with spark.master=local [Spark Branch]
> -
>
> Key: HIVE-8373
> URL: https://issues.apache.org/jira/browse/HIVE-8373
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: liyunzhang_intel
> Fix For: 2.2.0
>
>
> I have a straigh forward query to run in Spark local mode, but get an OOM 
> even though the data volumn is tiny:
> {code}
> Exception in thread "Spark Context Cleaner" 
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Spark Context Cleaner"
> Exception in thread "Executor task launch worker-1" 
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Executor task launch worker-1"
> Exception in thread "Keep-Alive-Timer" 
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Keep-Alive-Timer"
> Exception in thread "Driver Heartbeater" 
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Driver Heartbeater"
> {code}
> The query is:
> {code}
> select product_name, avg(item_price) as avg_price from product join item on 
> item.product_pk=product.product_pk group by product_name order by avg_price;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

2017-01-02 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794194#comment-15794194
 ] 

Xuefu Zhang commented on HIVE-15527:


[~csun], [~lirui], and [~kellyzly], thanks for your feedback. The patch here is 
more like a POC, so improvement is needed for production. Here are a few 
thoughts:

1) I'm not sure what caused the result diff, though there might be a bug in 
HiveKVResultCache that's manifested. The diff seems invalid when comparing to 
MR result. Also, there seems some randomness generating the diff.

2) As to performance, Rui's idea concern is valid. What I tried to demo is that 
we need something similar to HiveKVResultCache but only for values.

3) Similar to 2), we need to have a good cache size to avoid FIO for regular 
group sizes. Currently HiveKVREsultCache has cache only for 1024 rows, which 
seems rather small.

4) Performance impact needs to be evaluated.

5) The idea here could be used to solve the same problem for Spark's 
groupByKey() in Hive. We could use Spark's reduceByKey() instead and in Hive we 
do in-group value caching like what we can do here.

I'm not sure if I have bandwidth to move this forward at full speed. Please 
feel free to take this (and other issues) forward. Thanks.


> Memory usage is unbound in SortByShuffler for Spark
> ---
>
> Key: HIVE-15527
> URL: https://issues.apache.org/jira/browse/HIVE-15527
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15527.1.patch, HIVE-15527.2.patch, 
> HIVE-15527.3.patch, HIVE-15527.patch
>
>
> In SortByShuffler.java, an ArrayList is used to back the iterator for values 
> that have the same key in shuffled result produced by spark transformation 
> sortByKey. It's possible that memory can be exhausted because of a large key 
> group.
> {code}
> @Override
> public Tuple2> next() {
>   // TODO: implement this by accumulating rows with the same key 
> into a list.
>   // Note that this list needs to improved to prevent excessive 
> memory usage, but this
>   // can be done in later phase.
>   while (it.hasNext()) {
> Tuple2 pair = it.next();
> if (curKey != null && !curKey.equals(pair._1())) {
>   HiveKey key = curKey;
>   List values = curValues;
>   curKey = pair._1();
>   curValues = new ArrayList();
>   curValues.add(pair._2());
>   return new Tuple2>(key, 
> values);
> }
> curKey = pair._1();
> curValues.add(pair._2());
>   }
>   if (curKey == null) {
> throw new NoSuchElementException();
>   }
>   // if we get here, this should be the last element we have
>   HiveKey key = curKey;
>   curKey = null;
>   return new Tuple2>(key, 
> curValues);
> }
> {code}
> Since the output from sortByKey is already sorted on key, it's possible to 
> backup the value iterable using the same input iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

2017-01-02 Thread Rui Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-15526:
--
Attachment: HIVE-15526.1.patch

> Some tests need SORT_QUERY_RESULTS
> --
>
> Key: HIVE-15526
> URL: https://issues.apache.org/jira/browse/HIVE-15526
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-15526.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

2017-01-02 Thread Rui Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-15526:
--
Status: Patch Available  (was: Open)

> Some tests need SORT_QUERY_RESULTS
> --
>
> Key: HIVE-15526
> URL: https://issues.apache.org/jira/browse/HIVE-15526
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-15526.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2017-01-02 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794215#comment-15794215
 ] 

Rui Li commented on HIVE-15313:
---

[~Ferd], yeah I'm OK to update the wiki for performance first. I'll update 
again once the minimum set is determined.

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
> cd $SPARK_HOME/jars
> zip spark-archive.zip ./*.jar # this is important, enter the jars folder then 
> zip
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15525) Hooking ChangeManager to "drop table", "drop partition"

2017-01-02 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated HIVE-15525:
--
Attachment: HIVE-15525.2.patch

> Hooking ChangeManager to "drop table", "drop partition"
> ---
>
> Key: HIVE-15525
> URL: https://issues.apache.org/jira/browse/HIVE-15525
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: HIVE-15525.1.patch, HIVE-15525.2.patch
>
>
> When Hive "drop table"/"drop partition", we will move data files into cmroot 
> in case the replication destination will need it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2017-01-02 Thread Ferdinand Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdinand Xu resolved HIVE-15313.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Updated the (Configuring Hive section  
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive).
 Thanks [~kellyzly] [~xuefuz] and [~lirui] for the review and contribution.

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
> cd $SPARK_HOME/jars
> zip spark-archive.zip ./*.jar # this is important, enter the jars folder then 
> zip
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15530) Optimize the column stats update logic in table alteration

2017-01-02 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794303#comment-15794303
 ] 

Hive QA commented on HIVE-15530:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845326/HIVE-15530.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 10883 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=139)

[skewjoinopt15.q,vector_coalesce.q,orc_ppd_decimal.q,cbo_rp_lineage2.q,insert_into_with_schema.q,join_emit_interval.q,load_dyn_part3.q,auto_sortmerge_join_14.q,vector_null_projection.q,vector_cast_constant.q,mapjoin2.q,bucket_map_join_tez2.q,correlationoptimizer4.q,schema_evol_orc_acidvec_part_update.q,vectorization_12.q,vector_number_compare_projection.q,orc_merge_incompat3.q,vector_leftsemi_mapjoin.q,update_all_non_partitioned.q,multi_column_in_single.q,schema_evol_orc_nonvec_table.q,cbo_rp_semijoin.q,tez_insert_overwrite_local_directory_1.q,schema_evol_text_vecrow_table.q,vector_count.q,auto_sortmerge_join_15.q,vector_if_expr.q,delete_whole_partition.q,vector_decimal_6.q,sample1.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] 
(batchId=92)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2760/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2760/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2760/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 9 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845326 - PreCommit-HIVE-Build

> Optimize the column stats update logic in table alteration
> --
>
> Key: HIVE-15530
> URL: https://issues.apache.org/jira/browse/HIVE-15530
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Yibing Shi
>Assignee: Yibing Shi
> Attachments: HIVE-15530.1.patch
>
>
> Currently when a table is altered, if any of below conditions is true, HMS 
> would try to update column statistics for the table:
> # database name is changed
> # table name is changed
> # old columns and new columns are not the same
> As a result, when a column is added to a table, Hive also tries to update 
> column statistics, which is not necessary. We can loose the last condition by 
> checking whether all existing columns are changed or not. If not, we don't 
> have to update stats info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

2017-01-02 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794388#comment-15794388
 ] 

Hive QA commented on HIVE-15526:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12845328/HIVE-15526.1.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 10913 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_coalesce] 
(batchId=75)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=134)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=135)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=93)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2761/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2761/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2761/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12845328 - PreCommit-HIVE-Build

> Some tests need SORT_QUERY_RESULTS
> --
>
> Key: HIVE-15526
> URL: https://issues.apache.org/jira/browse/HIVE-15526
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-15526.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

2017-01-02 Thread Rui Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-15526:
--
Description: {{temp_table_gb1.q}} and {{vector_between_in.q}}

> Some tests need SORT_QUERY_RESULTS
> --
>
> Key: HIVE-15526
> URL: https://issues.apache.org/jira/browse/HIVE-15526
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-15526.1.patch
>
>
> {{temp_table_gb1.q}} and {{vector_between_in.q}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

2017-01-02 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794403#comment-15794403
 ] 

Rui Li commented on HIVE-15526:
---

Failures not related.
[~xuefuz], could you have a look? The change is trivial. Thanks.

> Some tests need SORT_QUERY_RESULTS
> --
>
> Key: HIVE-15526
> URL: https://issues.apache.org/jira/browse/HIVE-15526
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-15526.1.patch
>
>
> {{temp_table_gb1.q}} and {{vector_between_in.q}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15528) Expose Spark job error in SparkTask

[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

[jira] [Updated] (HIVE-15507) Nested column pruning: fix issue when selecting struct field from array/map element

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

[jira] [Commented] (HIVE-15525) Hooking ChangeManager to "drop table", "drop partition"

[jira] [Updated] (HIVE-15481) Support multiple and nested subqueries

[jira] [Updated] (HIVE-15481) Support multiple and nested subqueries

[jira] [Updated] (HIVE-15481) Support multiple and nested subqueries

[jira] [Updated] (HIVE-15530) Optimize the column stats update logic in table alteration

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

[jira] [Commented] (HIVE-15481) Support multiple and nested subqueries

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

[jira] [Comment Edited] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

[jira] [Commented] (HIVE-15529) LLAP: TaskSchedulerService can get stuck when scheduleTask returns DELAYED_RESOURCES

[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

[jira] [Updated] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

[jira] [Updated] (HIVE-15530) Optimize the column stats update logic in table alteration

[jira] [Assigned] (HIVE-15530) Optimize the column stats update logic in table alteration

[jira] [Updated] (HIVE-15530) Optimize the column stats update logic in table alteration

[jira] [Commented] (HIVE-15530) Optimize the column stats update logic in table alteration

[jira] [Resolved] (HIVE-8373) OOM for a simple query with spark.master=local [Spark Branch]

[jira] [Commented] (HIVE-15527) Memory usage is unbound in SortByShuffler for Spark

[jira] [Updated] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

[jira] [Updated] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

[jira] [Updated] (HIVE-15525) Hooking ChangeManager to "drop table", "drop partition"

[jira] [Resolved] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

[jira] [Commented] (HIVE-15530) Optimize the column stats update logic in table alteration

[jira] [Commented] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

[jira] [Updated] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

[jira] [Commented] (HIVE-15526) Some tests need SORT_QUERY_RESULTS

31 matches

Site Navigation

Mail list logo

Footer information