Wish you give our product a wonderful name

2017-09-08 Thread Jone Zhang
We have built an an ml platform, based on open source framework like
hadoop, spark, tensorflow. Now we need to give our product a wonderful
name, and eager for everyone's advice.

Any answers will be greatly appreciated.
Thanks.


How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Jone Zhang
For example
Data1(has 1 billion records)
user_id1  feature1
user_id1  feature2

Data2(has 1 billion records)
user_id1  feature3

Data3(has 1 billion records)
user_id1  feature4
user_id1  feature5
...
user_id1  feature100

I want to get the result as follow
user_id1  feature1 feature2 feature3 feature4 feature5...feature100

Is there a more efficient way except join?

Thanks!


Fwd: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-11 Thread Jone Zhang
Solve it by remove lazy identity.
2.HiveContext.sql("cache table feature as "select * from src where ...)
which result size is only 100K

-- Forwarded message ------
From: Jone Zhang 
Date: 2017-05-10 19:10 GMT+08:00
Subject: Why spark.sql.autoBroadcastJoinThreshold not available
To: "user @spark/'user @spark'/spark users/user@spark" <
u...@spark.apache.org>


Now i use spark1.6.0 in java
I wish the following sql to be executed in BroadcastJoin way
*select * from sample join feature*

This is my step
1.set spark.sql.autoBroadcastJoinThreshold=100M
2.HiveContext.sql("cache lazy table feature as "select * from src where
...) which result size is only 100K
3.HiveContext.sql("select * from sample join feature")
Why the join is SortMergeJoin?

Grateful for any idea!
Thanks.


How to create auto increment key for a table in hive?

2017-04-11 Thread Jone Zhang
The hive table write by many people.
How to create auto increment key for a table in hive?

For example
create table test(id, value)
load data v1 v2 into table test
load data v3 v4 into table test

select * from test
1 v1
2 v2
3 v3
4 v4
...


Thanks


Re: Two results are inconsistent when i use Hive on Spark

2016-01-26 Thread Jone Zhang
*Some properties on hive-site.xml is*


hive.ignore.mapjoin.hint
false


hive.auto.convert.join
true


hive.auto.convert.join.noconditionaltask
true


*If more information is required,please let us know.*

*Thanks.*

2016-01-27 15:20 GMT+08:00 Jone Zhang :

> *I have run a query many times, there will be two results without regular.*
> *One is 36834699 and other is 18464706.*
>
> *The query is *
> set spark.yarn.queue=soft.high;
> set hive.execution.engine=spark;
> select /*+mapjoin(t3,t4,t5)*/
>   count(1)
> from
>   (
>   select
> coalesce(t11.qua,t12.qua,t13.qua) qua,
> coalesce(t11.scene,t12.lanmu_id,t13.lanmu_id) scene,
> coalesce(t11.app_id,t12.appid,t13.app_id) app_id,
> expos_pv,
> expos_uv,
> dload_pv,
> dload_uv,
> dload_cnt,
> dload_user,
> evil_dload_cnt,
> evil_dload_user,
> update_dcnt,
> update_duser,
> hand_suc_incnt,
> hand_suc_inuser,
> day_hand_suc_incnt,
> day_hand_suc_inuser
>   from
> (select * from t_ed_soft_assist_useraction_stat where ds=20160126)t11
> full outer join
> (select * from t_md_soft_lanmu_app_dload_detail where ds=20160126)t12
> on t11.qua=t12.qua and t11.app_id=t12.appid and t11.scene=t12.lanmu_id
> full outer join
> (select * from t_md_soft_client_install_lanmu  where ds=20160126)t13
> on t11.qua=t13.qua and t11.app_id=t13.app_id and t11.scene=t13.lanmu_id
>   )t1
>   left outer join t_rd_qua t3 on t3.ds=20160126 and t1.qua=t3.qua
>   left outer join t_rd_soft_appnew_last t4 on t4.ds=20160126 and
> t1.app_id=t4.app_id
>   left outer join t_rd_soft_page_conf t5 on t5.ds=20160126 and
> t1.scene=t5.pageid and t3.client_type_id=t5.ismtt;
>
>
> *Explain query is*
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName: mqq_20160127151826_e8197f40-18d7-430c-9fc8-993facb74534:2
>   Vertices:
> Map 6
> Map Operator Tree:
> TableScan
>   alias: t3
>   Statistics: Num rows: 1051 Data size: 113569 Basic
> stats: COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: string)
>   1 qua (type: string)
> Local Work:
>   Map Reduce Local Work
> Map 7
> Map Operator Tree:
> TableScan
>   alias: t4
>   Statistics: Num rows: 2542751 Data size: 220433659 Basic
> stats: COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 UDFToDouble(_col2) (type: double)
>   1 UDFToDouble(app_id) (type: double)
> Local Work:
>   Map Reduce Local Work
> Map 8
> Map Operator Tree:
> TableScan
>   alias: t5
>   Statistics: Num rows: 143 Data size: 28605 Basic stats:
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col1 (type: string), UDFToDouble(_col20) (type:
> double)
>   1 pageid (type: string), UDFToDouble(ismtt) (type:
> double)
> Local Work:
>   Map Reduce Local Work
>
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 5), Map 4
> (PARTITION-LEVEL SORT, 5), Map 5 (PARTITION-LEVEL SORT, 5)
> Reducer 3 <- Reducer 2 (GROUP, 1)
>   DagName: mqq_20160127151826_e8197f40-18d7-430c-9fc8-993facb74534:1
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: t_ed_soft_assist_useraction_stat
>   Statistics: Num rows: 16368107 Data size: 651461220
> Basic stats: COMPLETE Column stats: NONE
>   Select Operator
> expressions: qua (type: string), scene (type: string),
> app_id (type: string)
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 16368107 Data size: 651461220
> Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
>   key expressions: _col0 (type: string),
> UDFToDouble(_col2) (type: double), _col1 (type: string)
>   sort order: +++
>   Map-reduce partition columns: _col0 (type: string),
> UDFToDoub

Two results are inconsistent when i use Hive on Spark

2016-01-26 Thread Jone Zhang
*I have run a query many times, there will be two results without regular.*
*One is 36834699 and other is 18464706.*

*The query is *
set spark.yarn.queue=soft.high;
set hive.execution.engine=spark;
select /*+mapjoin(t3,t4,t5)*/
  count(1)
from
  (
  select
coalesce(t11.qua,t12.qua,t13.qua) qua,
coalesce(t11.scene,t12.lanmu_id,t13.lanmu_id) scene,
coalesce(t11.app_id,t12.appid,t13.app_id) app_id,
expos_pv,
expos_uv,
dload_pv,
dload_uv,
dload_cnt,
dload_user,
evil_dload_cnt,
evil_dload_user,
update_dcnt,
update_duser,
hand_suc_incnt,
hand_suc_inuser,
day_hand_suc_incnt,
day_hand_suc_inuser
  from
(select * from t_ed_soft_assist_useraction_stat where ds=20160126)t11
full outer join
(select * from t_md_soft_lanmu_app_dload_detail where ds=20160126)t12
on t11.qua=t12.qua and t11.app_id=t12.appid and t11.scene=t12.lanmu_id
full outer join
(select * from t_md_soft_client_install_lanmu  where ds=20160126)t13
on t11.qua=t13.qua and t11.app_id=t13.app_id and t11.scene=t13.lanmu_id
  )t1
  left outer join t_rd_qua t3 on t3.ds=20160126 and t1.qua=t3.qua
  left outer join t_rd_soft_appnew_last t4 on t4.ds=20160126 and
t1.app_id=t4.app_id
  left outer join t_rd_soft_page_conf t5 on t5.ds=20160126 and
t1.scene=t5.pageid and t3.client_type_id=t5.ismtt;


*Explain query is*
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName: mqq_20160127151826_e8197f40-18d7-430c-9fc8-993facb74534:2
  Vertices:
Map 6
Map Operator Tree:
TableScan
  alias: t3
  Statistics: Num rows: 1051 Data size: 113569 Basic stats:
COMPLETE Column stats: NONE
  Spark HashTable Sink Operator
keys:
  0 _col0 (type: string)
  1 qua (type: string)
Local Work:
  Map Reduce Local Work
Map 7
Map Operator Tree:
TableScan
  alias: t4
  Statistics: Num rows: 2542751 Data size: 220433659 Basic
stats: COMPLETE Column stats: NONE
  Spark HashTable Sink Operator
keys:
  0 UDFToDouble(_col2) (type: double)
  1 UDFToDouble(app_id) (type: double)
Local Work:
  Map Reduce Local Work
Map 8
Map Operator Tree:
TableScan
  alias: t5
  Statistics: Num rows: 143 Data size: 28605 Basic stats:
COMPLETE Column stats: NONE
  Spark HashTable Sink Operator
keys:
  0 _col1 (type: string), UDFToDouble(_col20) (type:
double)
  1 pageid (type: string), UDFToDouble(ismtt) (type:
double)
Local Work:
  Map Reduce Local Work

  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 5), Map 4
(PARTITION-LEVEL SORT, 5), Map 5 (PARTITION-LEVEL SORT, 5)
Reducer 3 <- Reducer 2 (GROUP, 1)
  DagName: mqq_20160127151826_e8197f40-18d7-430c-9fc8-993facb74534:1
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: t_ed_soft_assist_useraction_stat
  Statistics: Num rows: 16368107 Data size: 651461220 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: qua (type: string), scene (type: string),
app_id (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 16368107 Data size: 651461220
Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: _col0 (type: string),
UDFToDouble(_col2) (type: double), _col1 (type: string)
  sort order: +++
  Map-reduce partition columns: _col0 (type: string),
UDFToDouble(_col2) (type: double), _col1 (type: string)
  Statistics: Num rows: 16368107 Data size: 651461220
Basic stats: COMPLETE Column stats: NONE
  value expressions: _col2 (type: string)
Map 4
Map Operator Tree:
TableScan
  alias: t_md_soft_lanmu_app_dload_detail
  Statistics: Num rows: 2503976 Data size: 203324640 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: qua (type: string), appid (type: bigint),
lanmu_id (type: string)
outputColumnNames: _col2, _col3, _col4
Statistics: Num rows: 2503976 Data size: 203324640
Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: _col2 (type: string),
UDFToD

Re: How to ensure that the record value of Hive on MapReduce and Hive on Spark are completely consistent?

2016-01-07 Thread Jone Zhang
2016-01-08 11:37 GMT+08:00 Jone Zhang :

> We made a comparison of the number of records between Hive on MapReduce
> and Hive on Spark.And they are in good agreement.
> But how to ensure that the record values of Hive on MapReduce and Hive on
> Spark are completely consistent?
> Do you have any suggestions?
>
> Best wishes.
> Thanks.
>


How to ensure that the record value of Hive on MapReduce and Hive on Spark are completely consistent?

2016-01-07 Thread Jone Zhang
We made a comparison of the number of records between Hive on MapReduce and
Hive on Spark.And they are in good agreement.
But how to ensure that the record values of Hive on MapReduce and Hive on
Spark are completely consistent?
Do you have any suggestions?

Best wishes.
Thanks.


Re: It seems that result of Hive on Spark is mistake And result of Hive and Hive on Spark are not the same

2015-12-22 Thread Jone Zhang
Hive 1.2.1 on Spark1.4.1

2015-12-22 19:31 GMT+08:00 Jone Zhang :

> *select  * from staff;*
> 1 jone 22 1
> 2 lucy 21 1
> 3 hmm 22 2
> 4 james 24 3
> 5 xiaoliu 23 3
>
> *select id,date_ from trade union all select id,"test" from trade ;*
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
>
> *set hive.execution.engine=spark;*
> *set spark.master=local;*
> *select /*+mapjoin(t)*/ * from staff s join *
> *(select id,date_ from trade union all select id,"test" from trade ) t on
> s.id <http://s.id>=t.id <http://t.id>;*
> 1 jone 22 1 1 201510210908
> 2 lucy 21 1 2 201509080234
> 2 lucy 21 1 2 201509080235
>
> *set hive.execution.engine=mr;*
> *select /*+mapjoin(t)*/ * from staff s join *
> *(select id,date_ from trade union all select id,"test" from trade ) t on
> s.id <http://s.id>=t.id <http://t.id>;*
> FAILED: SemanticException [Error 10227]: Not all clauses are supported
> with mapjoin hint. Please remove mapjoin hint.
>
> *I have two questions*
> *1.Why result of hive on spark not include the following record?*
> 1 jone 22 1 1 test
> 2 lucy 21 1 2 test
> 2 lucy 21 1 2 test
>
> *2.Why there are two different ways of dealing same query?*
>
>
> *explain 1:*
> *set hive.execution.engine=spark;*
> *set spark.master=local;*
> *explain *
> *select id,date_ from trade union all select id,"test" from trade;*
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   DagName:
> jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: id (type: int), date_ (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats:
> COMPLETE Column stats: NONE
>   table:
>   input format:
> org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 2
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>   Select Operator
> expressions: id (type: int), 'test' (type: string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 12 Data size: 96 Basic stats:
> COMPLETE Column stats: NONE
>   table:
>   input format:
> org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
>
>
> *explain 2:*
> *set hive.execution.engine=spark;*
> *set spark.master=local;*
> *explain *
> *select /*+mapjoin(t)*/ * from staff s join *
> *(select id,date_ from trade union all select id,"209" from trade
> ) t on s.id <http://s.id>=t.id <http://t.id>;*
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>   DagName:
> jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: trade
>   Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>   Filter Operator
> p

It seems that result of Hive on Spark is mistake And result of Hive and Hive on Spark are not the same

2015-12-22 Thread Jone Zhang
*select  * from staff;*
1 jone 22 1
2 lucy 21 1
3 hmm 22 2
4 james 24 3
5 xiaoliu 23 3

*select id,date_ from trade union all select id,"test" from trade ;*
1 201510210908
2 201509080234
2 201509080235
1 test
2 test
2 test

*set hive.execution.engine=spark;*
*set spark.master=local;*
*select /*+mapjoin(t)*/ * from staff s join *
*(select id,date_ from trade union all select id,"test" from trade ) t on
s.id =t.id ;*
1 jone 22 1 1 201510210908
2 lucy 21 1 2 201509080234
2 lucy 21 1 2 201509080235

*set hive.execution.engine=mr;*
*select /*+mapjoin(t)*/ * from staff s join *
*(select id,date_ from trade union all select id,"test" from trade ) t on
s.id =t.id ;*
FAILED: SemanticException [Error 10227]: Not all clauses are supported with
mapjoin hint. Please remove mapjoin hint.

*I have two questions*
*1.Why result of hive on spark not include the following record?*
1 jone 22 1 1 test
2 lucy 21 1 2 test
2 lucy 21 1 2 test

*2.Why there are two different ways of dealing same query?*


*explain 1:*
*set hive.execution.engine=spark;*
*set spark.master=local;*
*explain *
*select id,date_ from trade union all select id,"test" from trade;*
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Spark
  DagName:
jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: trade
  Statistics: Num rows: 6 Data size: 48 Basic stats:
COMPLETE Column stats: NONE
  Select Operator
expressions: id (type: int), date_ (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 6 Data size: 48 Basic stats:
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 12 Data size: 96 Basic stats:
COMPLETE Column stats: NONE
  table:
  input format:
org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Map 2
Map Operator Tree:
TableScan
  alias: trade
  Statistics: Num rows: 6 Data size: 48 Basic stats:
COMPLETE Column stats: NONE
  Select Operator
expressions: id (type: int), 'test' (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 6 Data size: 48 Basic stats:
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 12 Data size: 96 Basic stats:
COMPLETE Column stats: NONE
  table:
  input format:
org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink


*explain 2:*
*set hive.execution.engine=spark;*
*set spark.master=local;*
*explain *
*select /*+mapjoin(t)*/ * from staff s join *
*(select id,date_ from trade union all select id,"209" from trade )
t on s.id =t.id ;*
OK
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
Spark
  DagName:
jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: trade
  Statistics: Num rows: 6 Data size: 48 Basic stats:
COMPLETE Column stats: NONE
  Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 3 Data size: 24 Basic stats:
COMPLETE Column stats: NONE
Select Operator
  expressions: id (type: int), date_ (type: string)
  outputColumnNames: _col0, _col1
  Statistics: Num rows: 3 Data size: 24 Basic stats:
COMPLETE Column stats: NONE
  Spark HashTable Sink Operator
keys:
  0 id (type: int)
  1 _col0 (type: int)
Local Work:
  Map Reduce Local Work

  Stage: Stage-1
Spark
  DagName:
jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:2
  Vertices:
Map 2
Map Operator Tree:
TableScan
  alias: s

Re: "java.lang.RuntimeException: Reduce operator initialization failed" when running hive on spark

2015-12-20 Thread Jone Zhang
I also encountered the same problem.

The error log in Spark UI are as follows

Job aborted due to stage failure: Task 465 in stage 12.0 failed 4
times, most recent failure: Lost task 465.3 in stage 12.0 (TID 6732,
10.148.147.52): java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Error while
processing row (tag=0)
{"key":{"_col0":"TMAF_610_F_2179","_col1":"200402","_col2":"203901","_col3":"08_001_00;-1","_col4":"08","_col5":"001","_col6":"1001013133098807296","_col7":100,"_col8":"welfarecenter&&10451659"},"value":{"_col0":1}}
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:293)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error
while processing row (tag=0)
{"key":{"_col0":"TMAF_610_F_2179","_col1":"200402","_col2":"203901","_col3":"08_001_00;-1","_col4":"08","_col5":"001","_col6":"1001013133098807296","_col7":100,"_col8":"welfarecenter&&10451659"},"value":{"_col0":1}}
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:332)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:281)
... 13 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
Unexpected exception: null
at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:426)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1016)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:821)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:323)
... 14 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.getRefKey(MapJoinOperator.java:327)
at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:339)
... 22 more

Driver stacktrace:


2015-12-21 10:29 GMT+08:00 luo butter :

> hi,
>
> When I used hive on spark it thrown* below exceptions* when processing
> map side join:
>
> *java.lang.RuntimeException: Reduce operator initialization failed*
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.init(SparkReduceRecordHandler.java:224)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunction.call(HiveReduceFunction.java:46)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunction.call(HiveReduceFunction.java:28)
> at
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:186)
> at
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:186)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apac

Hive on Spark throw java.lang.NullPointerException

2015-12-17 Thread Jone Zhang
*My query is *
set hive.execution.engine=spark;
select
t3.pcid,channel,version,ip,hour,app_id,app_name,app_apk,app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num,
(case when t4.cnt is null then 0 else 1 end) as is_evil
from
(select /*+mapjoin(t2)*/
pcid,channel,version,ip,hour,
(case when t2.app_id is null then t1.app_id else t2.app_id end) as app_id,
t2.name as app_name,
app_apk,
app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num
from
t_ed_soft_downloadlog_molo t1 left outer join t_rd_soft_app_pkg_name t2 on
(lower(t1.app_apk) = lower(t2.package_id) and t1.ds = 20151217 and t2.ds =
20151217)
where
t1.ds = 20151217) t3
left outer join
(
select pcid,count(1) cnt  from t_ed_soft_evillog_molo where ds=20151217
 group by pcid
) t4
on t3.pcid=t4.pcid;


*And the error log is *
2015-12-18 08:10:18,685 INFO  [main]: spark.SparkMapJoinOptimizer
(SparkMapJoinOptimizer.java:process(79)) - Check if it can be converted to
map join
2015-12-18 08:10:18,686 ERROR [main]: ql.Driver
(SessionState.java:printError(966)) - FAILED: NullPointerException null
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedParentMapJoinSize(SparkMapJoinOptimizer.java:312)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedMapJoinSize(SparkMapJoinOptimizer.java:292)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getMapJoinConversionInfo(SparkMapJoinOptimizer.java:271)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.process(SparkMapJoinOptimizer.java:80)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkJoinOptimizer.process(SparkJoinOptimizer.java:58)
at
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:92)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:97)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:81)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:135)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:112)
at
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:128)
at
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:102)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10238)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:210)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:233)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:425)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1171)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1060)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1050)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:208)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:160)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:447)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:795)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:767)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:704)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)


*Some properties on hive-site.xml is *

   hive.ignore.mapjoin.hint
   false


hive.auto.convert.join
true


   hive.auto.convert.join.noconditionaltask
   true



*The error relevant code is *
long mjSize = ctx.getMjOpSizes().get(op);
*I think it should be checked whether or not * ctx.getMjOpSizes().get(op) *is
null.*

*Of course, more strict logic need to you to decide.*


*Thanks.*
*Best Wishes.*


Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang
*It seems that the submit number depend on stage of the query.*
*This query include three stages.*

If queue resources is still *not enough after submit threee
applications,** Hive
client will close.*
*"**Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
client.)'*
*FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.spark.SparkTask**"*
*And this time, the port(eg **34682**)  kill in hive client(eg *
*10.179.12.140**) use to **communicate with RSC **will  lost.*

*The reources of queue is free **after awhile, the AM of three applications
will fast fail beacause of "**15/12/10 12:28:43 INFO client.RemoteDriver:
Connecting to: 10.179.12.140:34682...java.net.ConnectException: Connection
refused: /10.179.12.140:34682 <http://10.179.12.140:34682>**"*

*So, The application will fail if the queue resources if not **enough at
point of the query be submited, even if the resources is free **after
awhile.*
*Do you have more idea about this question?*

*Attch the query*
set hive.execution.engine=spark;
set spark.yarn.queue=tms;
set spark.app.name=t_ad_tms_heartbeat_ok_3;
insert overwrite table t_ad_tms_heartbeat_ok partition(ds=20151208)
SELECT
NVL(a.qimei, b.qimei) AS qimei,
NVL(b.first_ip,a.user_ip) AS first_ip,
NVL(a.user_ip, b.last_ip) AS last_ip,
NVL(b.first_date, a.ds) AS first_date,
NVL(a.ds, b.last_date) AS last_date,
NVL(b.first_chid, a.chid) AS first_chid,
NVL(a.chid, b.last_chid) AS last_chid,
NVL(b.first_lc, a.lc) AS first_lc,
NVL(a.lc, b.last_lc) AS last_lc,
NVL(a.guid, b.guid) AS guid,
NVL(a.sn, b.sn) AS sn,
NVL(a.vn, b.vn) AS vn,
NVL(a.vc, b.vc) AS vc,
NVL(a.mo, b.mo) AS mo,
NVL(a.rl, b.rl) AS rl,
NVL(a.os, b.os) AS os,
NVL(a.rv, b.rv) AS rv,
NVL(a.qv, b.qv) AS qv,
NVL(a.imei, b.imei) AS imei,
NVL(a.romid, b.romid) AS romid,
NVL(a.bn, b.bn) AS bn,
NVL(a.account_type, b.account_type) AS account_type,
NVL(a.account, b.account) AS account
FROM
(SELECT
ds,user_ip,guid,sn,vn,vc,mo,rl,chid,lcid,os,rv,qv,imei,qimei,lc,romid,bn,account_type,account
FROMt_od_tms_heartbeat_ok
WHERE   ds = 20151208) a
FULL OUTER JOIN
(SELECT
qimei,first_ip,last_ip,first_date,last_date,first_chid,last_chid,first_lc,last_lc,guid,sn,vn,vc,mo,rl,os,rv,qv,imei,romid,bn,account_type,account
FROMt_ad_tms_heartbeat_ok
WHERE   last_date > 20150611
AND ds = 20151207) b
ON   a.qimei=b.qimei;

*Thanks.*
*Best wishes.*

2015-12-09 19:51 GMT+08:00 Jone Zhang :

> But in some cases all of the applications will fail which caused
>> by SparkContext did not initialize after waiting for 15 ms.
>> See attchment (hive.spark.client.server.connect.timeout is set to 5min).
>
>
> *The error log is different  from original mail*
>
> Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041
>
> 
> LogType: stderr
> LogLength: 3302
> Log Contents:
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId:
> appattempt_1448873753366_113453_01
> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
> 15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(mqq); users with modify permissions: Set(mqq)
> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user
> application in a separate Thread
> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization
> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization ...
> 15/12/09 02:11:49 INFO client.RemoteDriver: Connecting to:
> 10.179.12.140:58013
> 15/12/09 02:11:49 ERROR yarn.ApplicationMaster: User class threw
> exception: java.util.concurrent.ExecutionException:
> java.net.ConnectException: Connection refused: /10.179.12.140:58013
> java.util.concurrent.ExecutionException: java.net.ConnectException:
> Connection refused: /10.179.12.140:58013

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang
>
> But in some cases all of the applications will fail which caused
> by SparkContext did not initialize after waiting for 15 ms.
> See attchment (hive.spark.client.server.connect.timeout is set to 5min).


*The error log is different  from original mail*

Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041

LogType: stderr
LogLength: 3302
Log Contents:
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
in the future
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
in the future
15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers
for [TERM, HUP, INT]
15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1448873753366_113453_01
15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(mqq); users with modify permissions: Set(mqq)
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user
application in a separate Thread
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
initialization
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context
initialization ...
15/12/09 02:11:49 INFO client.RemoteDriver: Connecting to:
10.179.12.140:58013
15/12/09 02:11:49 ERROR yarn.ApplicationMaster: User class threw exception:
java.util.concurrent.ExecutionException: java.net.ConnectException:
Connection refused: /10.179.12.140:58013
java.util.concurrent.ExecutionException: java.net.ConnectException:
Connection refused: /10.179.12.140:58013
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
at
org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
Caused by: java.net.ConnectException: Connection refused: /
10.179.12.140:58013
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.util.concurrent.ExecutionException: java.net.ConnectException:
Connection refused: /10.179.12.140:58013)
15/12/09 02:11:59 ERROR yarn.ApplicationMaster: SparkContext did not
initialize after waiting for 15 ms. Please check earlier log output for
errors. Failing the application.
15/12/09 02:11:59 INFO util.Utils: Shutdown hook called

2015-12-09 19:22 GMT+08:00 Jone Zhang :

> Hive version is 1.2.1
> Spark version is 1.4.1
> Hadoop version is 2.5.1
>
> The application_1448873753366_121062 will success in the above mail.
>
> But in some cases all of the applications will fail which caused by 
> SparkContext
> did not initialize after waiting for 15 ms.
> See attchment (hive.spark.client.server.connect.timeout is set to 5min).
>
> Thanks.
> Best wishes.
>
> 2015-12-09 17:56 GMT+08:00 Jone Zhang :
>
>> *Hi, Xuefu:*
>>
>> *See attachment 1*
>> *When the queue resources is not enough.*
>> *The application application_1448873753366_121022 will pending.*
>> *Two minutes later, the application application_1448873753366_121055 will
>> be submited and pending.*
>> *And then application_1448873753366_121062.*
>>
>> *See attachment 2*
>> *When the queue resources is free.*
>> *The application  application_1448873753366_121062 begin to running.*
>> *Application_144887375

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang
Hive version is 1.2.1
Spark version is 1.4.1
Hadoop version is 2.5.1

The application_1448873753366_121062 will success in the above mail.

But in some cases all of the applications will fail which caused by
SparkContext
did not initialize after waiting for 15 ms.
See attchment (hive.spark.client.server.connect.timeout is set to 5min).

Thanks.
Best wishes.

2015-12-09 17:56 GMT+08:00 Jone Zhang :

> *Hi, Xuefu:*
>
> *See attachment 1*
> *When the queue resources is not enough.*
> *The application application_1448873753366_121022 will pending.*
> *Two minutes later, the application application_1448873753366_121055 will
> be submited and pending.*
> *And then application_1448873753366_121062.*
>
> *See attachment 2*
> *When the queue resources is free.*
> *The application  application_1448873753366_121062 begin to running.*
> *Application_1448873753366_121022 and application_1448873753366_121055
>  will failed fast.*
>
> *Logs of Application_1448873753366_121022 as follows(same as *
> *application_1448873753366_121055**):*
> Container: container_1448873753366_121022_03_01 on 10.226.136.122_8041
>
> 
> LogType: stderr
> LogLength: 4664
> Log Contents:
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
> in the future
> 15/12/09 16:29:45 INFO yarn.ApplicationMaster: Registered signal handlers
> for [TERM, HUP, INT]
> 15/12/09 16:29:46 INFO yarn.ApplicationMaster: ApplicationAttemptId:
> appattempt_1448873753366_121022_03
> 15/12/09 16:29:47 INFO spark.SecurityManager: Changing view acls to: mqq
> 15/12/09 16:29:47 INFO spark.SecurityManager: Changing modify acls to: mqq
> 15/12/09 16:29:47 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(mqq); users with modify permissions: Set(mqq)
> 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Starting the user
> application in a separate Thread
> 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization
> 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context
> initialization ...
> 15/12/09 16:29:47 INFO client.RemoteDriver: Connecting to:
> 10.179.12.140:38842
> 15/12/09 16:29:48 WARN rpc.Rpc: Invalid log level null, reverting to
> default.
> 15/12/09 16:29:48 ERROR yarn.ApplicationMaster: User class threw
> exception: java.util.concurrent.ExecutionException:
> javax.security.sasl.SaslException: Client closed before SASL negotiation
> finished.
> java.util.concurrent.ExecutionException:
> javax.security.sasl.SaslException: Client closed before SASL negotiation
> finished.
> at
> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
> at
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL
> negotiation finished.
> at
> org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:449)
> at
> org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> at
> org.apache.hive.spark.client.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:127)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
> at
> io.n

Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Jone Zhang
*Hi, Xuefu:*

*See attachment 1*
*When the queue resources is not enough.*
*The application application_1448873753366_121022 will pending.*
*Two minutes later, the application application_1448873753366_121055 will
be submited and pending.*
*And then application_1448873753366_121062.*

*See attachment 2*
*When the queue resources is free.*
*The application  application_1448873753366_121062 begin to running.*
*Application_1448873753366_121022 and application_1448873753366_121055
 will failed fast.*

*Logs of Application_1448873753366_121022 as follows(same as *
*application_1448873753366_121055**):*
Container: container_1448873753366_121022_03_01 on 10.226.136.122_8041

LogType: stderr
LogLength: 4664
Log Contents:
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
in the future
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
in the future
15/12/09 16:29:45 INFO yarn.ApplicationMaster: Registered signal handlers
for [TERM, HUP, INT]
15/12/09 16:29:46 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1448873753366_121022_03
15/12/09 16:29:47 INFO spark.SecurityManager: Changing view acls to: mqq
15/12/09 16:29:47 INFO spark.SecurityManager: Changing modify acls to: mqq
15/12/09 16:29:47 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(mqq); users with modify permissions: Set(mqq)
15/12/09 16:29:47 INFO yarn.ApplicationMaster: Starting the user
application in a separate Thread
15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context
initialization
15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context
initialization ...
15/12/09 16:29:47 INFO client.RemoteDriver: Connecting to:
10.179.12.140:38842
15/12/09 16:29:48 WARN rpc.Rpc: Invalid log level null, reverting to
default.
15/12/09 16:29:48 ERROR yarn.ApplicationMaster: User class threw exception:
java.util.concurrent.ExecutionException: javax.security.sasl.SaslException:
Client closed before SASL negotiation finished.
java.util.concurrent.ExecutionException: javax.security.sasl.SaslException:
Client closed before SASL negotiation finished.
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
at
org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
Caused by: javax.security.sasl.SaslException: Client closed before SASL
negotiation finished.
at
org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:449)
at
org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at
org.apache.hive.spark.client.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:127)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769)
at
io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:567)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/12/09 16:29:48 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.util.concurrent.ExecutionException: javax.security.sasl.SaslException:
Client closed before SASL n

Re: Fw: Managed to make Hive run on Spark engine

2015-12-08 Thread Jone Zhang
You can search for last month's mailing list with "Do you have more
suggestions on when to use Hive on MapReduce or Hive on Spark?"
I hope for you a little help.

Best wishes.

2015-12-08 6:18 GMT+08:00 Ashok Kumar :

>
> This is great news sir. It shows perseverance pays at last.
>
> Can you inform us when the write-up is ready so I can set it up as well
> please.
>
> I know a bit about the advantages of having Hive using Spark engine.
> However, the general question I have is when one should use Hive on spark
> as opposed to Hive on MapReduce engine?
>
> Thanks again
>
>
>
>
> On Monday, 7 December 2015, 15:50, Mich Talebzadeh 
> wrote:
>
>
> For those interested
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 06 December 2015 20:33
> *To:* user@hive.apache.org
> *Subject:* Managed to make Hive run on Spark engine
>
> Thanks all especially to Xuefu.for contributions. Finally it works, which
> means don’t give up until it works J
>
> hduser@rhes564::/usr/lib/hive/lib> hive
> Logging initialized using configuration in
> jar:file:/usr/lib/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
> *hive> set spark.home= /usr/lib/spark-1.3.1-bin-hadoop2.6;*
> *hive> set hive.execution.engine=spark;*
> *hive> set spark.master=spark://50.140.197.217:7077
> ;*
> *hive> set spark.eventLog.enabled=true;*
> *hive> set spark.eventLog.dir= /usr/lib/spark-1.3.1-bin-hadoop2.6/logs;*
> *hive> set spark.executor.memory=512m;*
> *hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;*
> *hive> set hive.spark.client.server.connect.timeout=22ms;*
> *hive> set
> spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;*
> hive> use asehadoop;
> OK
> Time taken: 0.638 seconds
> hive> *select count(1) from t;*
> Query ID = hduser_20151206200528_4b85889f-e4ca-41d2-9bd2-1082104be42b
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Starting Spark Job = c8fee86c-0286-4276-aaa1-2a5eb4e4958a
>
> Query Hive on Spark job[0] stages:
> 0
> 1
>
> Status: Running (Hive on Spark job[0])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2015-12-06 20:05:36,299 Stage-0_0: 0(+1)/1  Stage-1_0: 0/1
> 2015-12-06 20:05:39,344 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
> 2015-12-06 20:05:40,350 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
> Status: Finished successfully in 8.10 seconds
> OK
>
> The versions used for this project
>
>
> OS version Linux version 2.6.18-92.el5xen (
> brewbuil...@ls20-bc2-13.build.redhat.com) (gcc version 4.1.2 20071124
> (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:31:30 EDT 2008
>
> Hadoop 2.6.0
> Hive 1.2.1
> spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild 
> spark-1.3.1-bin-hadoop2.6.gz
> for starting spark standalone cluster)
> The Jar file used in $HIVE_HOME/lib to link Hive to spark was à
> spark-assembly-1.3.1-hadoop2.4.0.jar
>(built from the source downloaded as zipped file spark-1.3.1.gz and
> built with command line make-distribution.sh --name
> "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
> Pretty picky on parameters, CLASSPATH, IP addresses or hostname etc to
> make it work
>
> I will create a full guide on how to build and make Hive to run with Spark
> as its engine (as opposed to MR).
>
> HTH
>
> Mich Talebzadeh
>
> *Sybase ASE 15 Gold Medal Award 2008*
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
> *Publications due shortly:*
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
> http://talebzadehmich.wordpress.com/
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
>
>
>


Re: Managed to make Hive run on Spark engine

2015-12-07 Thread Jone Zhang
More and more people are beginning to use hive on spark.
Congratulations!

2015-12-07 9:12 GMT+08:00 Link Qian :

> congrats!
>
> Link Qian
>
> --
> Date: Sun, 6 Dec 2015 15:44:58 -0500
> Subject: Re: Managed to make Hive run on Spark engine
> From: leftylever...@gmail.com
> To: user@hive.apache.org
>
>
> Congratulations!
>
> -- Lefty
>
> On Sun, Dec 6, 2015 at 3:32 PM, Mich Talebzadeh 
> wrote:
>
> Thanks all especially to Xuefu.for contributions. Finally it works, which
> means don’t give up until it works J
>
>
>
> hduser@rhes564::/usr/lib/hive/lib> hive
>
> Logging initialized using configuration in
> jar:file:/usr/lib/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
>
> *hive> set spark.home= /usr/lib/spark-1.3.1-bin-hadoop2.6;*
>
> *hive> set hive.execution.engine=spark;*
>
> *hive> set spark.master=spark://50.140.197.217:7077
> ;*
>
> *hive> set spark.eventLog.enabled=true;*
>
> *hive> set spark.eventLog.dir= /usr/lib/spark-1.3.1-bin-hadoop2.6/logs;*
>
> *hive> set spark.executor.memory=512m;*
>
> *hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;*
>
> *hive> set hive.spark.client.server.connect.timeout=22ms;*
>
> *hive> set
> spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;*
>
> hive> use asehadoop;
>
> OK
>
> Time taken: 0.638 seconds
>
> hive> *select count(1) from t;*
>
> Query ID = hduser_20151206200528_4b85889f-e4ca-41d2-9bd2-1082104be42b
>
> Total jobs = 1
>
> Launching Job 1 out of 1
>
> In order to change the average load for a reducer (in bytes):
>
>   set hive.exec.reducers.bytes.per.reducer=
>
> In order to limit the maximum number of reducers:
>
>   set hive.exec.reducers.max=
>
> In order to set a constant number of reducers:
>
>   set mapreduce.job.reduces=
>
> Starting Spark Job = c8fee86c-0286-4276-aaa1-2a5eb4e4958a
>
>
>
> Query Hive on Spark job[0] stages:
>
> 0
>
> 1
>
>
>
> Status: Running (Hive on Spark job[0])
>
> Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> 2015-12-06 20:05:36,299 Stage-0_0: 0(+1)/1  Stage-1_0: 0/1
>
> 2015-12-06 20:05:39,344 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
>
> 2015-12-06 20:05:40,350 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
>
> Status: Finished successfully in 8.10 seconds
>
> OK
>
>
>
> The versions used for this project
>
>
>
>
>
> OS version Linux version 2.6.18-92.el5xen (
> brewbuil...@ls20-bc2-13.build.redhat.com) (gcc version 4.1.2 20071124
> (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:31:30 EDT 2008
>
>
>
> Hadoop 2.6.0
>
> Hive 1.2.1
>
> spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild 
> spark-1.3.1-bin-hadoop2.6.gz
> for starting spark standalone cluster)
>
> The Jar file used in $HIVE_HOME/lib to link Hive to spark was à
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
>(built from the source downloaded as zipped file spark-1.3.1.gz and
> built with command line make-distribution.sh --name
> "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
> Pretty picky on parameters, CLASSPATH, IP addresses or hostname etc to
> make it work
>
>
>
> I will create a full guide on how to build and make Hive to run with Spark
> as its engine (as opposed to MR).
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
>
>


Re: Why there are two different stages on the same query when i use hive on spark.

2015-12-03 Thread Jone Zhang
ge-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 100), Map 3
(PARTITION-LEVEL SORT, 100)
  DagName: mqq_20151204103243_3eab6e6c-941e-476a-897f-cae97657063e:3
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: t_sd_ucm_cominfo_finalresult
  Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: uin (type: string), clientip (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
  value expressions: _col1 (type: string)
Map 3
Map Operator Tree:
TableScan
  alias: t_sd_ucm_cominfo_finalresult
  Statistics: Num rows: 590130 Data size: 118026051 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: uin (type: string), clientip (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 590130 Data size: 118026051 Basic
stats: COMPLETE Column stats: NONE
Reduce Output Operator
  key expressions: _col0 (type: string)
  sort order: +
  Map-reduce partition columns: _col0 (type: string)
  Statistics: Num rows: 590130 Data size: 118026051
Basic stats: COMPLETE Column stats: NONE
  value expressions: _col1 (type: string)
Reducer 2
Reduce Operator Tree:
  Join Operator
condition map:
 Left Outer Join0 to 1
keys:
  0 _col0 (type: string)
  1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 649143 Data size: 129828658 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: _col3 is null (type: boolean)
  Statistics: Num rows: 324571 Data size: 64914228 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 324571 Data size: 64914228 Basic
stats: COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 324571 Data size: 64914228
Basic stats: COMPLETE Column stats: NONE
  table:
  input format:
org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-0
Move Operator
  tables:
  partition:
ds 20151201
  replace: true
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-2
Stats-Aggr Operator

*Thanks.*

2015-12-03 22:17 GMT+08:00 Xuefu Zhang :

> Can you also attach explain query result? What's your data format?
>
> --Xuefu
>
> On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang 
> wrote:
>
>> Hive1.2.1 on Spark1.4.1
>>
>> *The first query is:*
>> set mapred.reduce.tasks=100;
>> use u_wsd;
>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>> 20151202)
>> select t1.uin,t1.clientip from
>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
>> t1
>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>> where ds=20151201) t2
>> on t1.uin=t2.uin
>> where t2.clientip is NULL;
>>
>> *The second query is:*
>> set mapred.reduce.tasks=100;
>> use u_wsd;
>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>> 20151201)
>> select t1.uin,t1.clientip from
>> (select 

Why there are two different stages on the same query when i use hive on spark.

2015-12-03 Thread Jone Zhang
Hive1.2.1 on Spark1.4.1

*The first query is:*
set mapred.reduce.tasks=100;
use u_wsd;
insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151202)
select t1.uin,t1.clientip from
(select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
t1
left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
where ds=20151201) t2
on t1.uin=t2.uin
where t2.clientip is NULL;

*The second query is:*
set mapred.reduce.tasks=100;
use u_wsd;
insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151201)
select t1.uin,t1.clientip from
(select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
t1
left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
where ds=20151130) t2
on t1.uin=t2.uin
where t2.clientip is NULL;

*The attachment show the two query's stages.*
*Here is the partition info*
104.3 M
 /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
110.0 M
 /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
112.6 M
 /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130



*Why there are two different stages?*
*The stage1 in first query is very slowly.*

*Thanks.*
*Best wishes.*


Re: Java heap space occured when the amount of data is very large with the same key on join sql

2015-11-28 Thread Jone Zhang
Add a little:
The Hive version is 1.2.1
The Spark version is 1.4.1
The Hadoop version is 2.5.1

2015-11-26 20:36 GMT+08:00 Jone Zhang :

> Here is an error message:
>
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2245)
> at java.util.Arrays.copyOf(Arrays.java:2219)
> at java.util.ArrayList.grow(ArrayList.java:242)
> at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
> at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
> at java.util.ArrayList.add(ArrayList.java:440)
> at
> org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:95)
> at
> org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:70)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
> at
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> And the note from the SortByShuffler.java
>   // TODO: implement this by accumulating rows with the same
> key into a list.
>   // Note that this list needs to improved to prevent
> excessive memory usage, but this
>   // can be done in later phase.
>
>
> The join sql run success when i use hive on mapreduce.
> So how do mapreduce deal with it?
> And Is there plan to improved to prevent excessive memory usage?
>
> Best wishes!
> Thanks!
>


Java heap space occured when the amount of data is very large with the same key on join sql

2015-11-26 Thread Jone Zhang
Here is an error message:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2245)
at java.util.Arrays.copyOf(Arrays.java:2219)
at java.util.ArrayList.grow(ArrayList.java:242)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
at java.util.ArrayList.add(ArrayList.java:440)
at
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:95)
at
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:70)
at
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
at
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


And the note from the SortByShuffler.java
  // TODO: implement this by accumulating rows with the same
key into a list.
  // Note that this list needs to improved to prevent excessive
memory usage, but this
  // can be done in later phase.


The join sql run success when i use hive on mapreduce.
So how do mapreduce deal with it?
And Is there plan to improved to prevent excessive memory usage?

Best wishes!
Thanks!


Re: Hive version with Spark

2015-11-19 Thread Jone Zhang
*-Phive is e**nough*
*-Phive will use hive1.2.1 default on Spark1.5.0+*

2015-11-19 4:50 GMT+08:00 Udit Mehta :

> As per this link :
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started,
> you need to build Spark without Hive.
>
> On Wed, Nov 18, 2015 at 8:50 AM, Sofia 
> wrote:
>
>> Hello
>>
>> After various failed tries to use my Hive (1.2.1) with my Spark (Spark
>> 1.4.1 built for Hadoop 2.2.0) I decided to try to build again Spark with
>> Hive.
>> I would like to know what is the latest Hive version that can be used to
>> build Spark at this point.
>>
>> When downloading Spark 1.5 source and trying:
>>
>> *mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-1.2.1
>> -Phive-thriftserver  -DskipTests clean package*
>>
>> I get :
>>
>> *The requested profile "hive-1.2.1" could not be activated because it
>> does not exist.*
>>
>> Thank you
>> Sofia
>>
>
>


Re: Building Spark to use for Hive on Spark

2015-11-19 Thread Jone Zhang
I should add that Spark1.5.0+ is used hive1.2.1 default when you use -Phive

So this page

shoule
write like below
“Note that you must have a version of Spark which does *not* include the
Hive jars if you use Spark1.4.1 and before, You can choose Spark1.5.0+
which  build include the Hive jars ”


2015-11-19 5:12 GMT+08:00 Gopal Vijayaraghavan :

>
>
> > I wanted to know  why is it necessary to remove the Hive jars from the
> >Spark build as mentioned on this
>
> Because SparkSQL was originally based on Hive & still uses Hive AST to
> parse SQL.
>
> The org.apache.spark.sql.hive package contains the parser which has
> hard-references to the hive's internal AST, which is unfortunately
> auto-generated code (HiveParser.TOK_TABNAME etc).
>
> Everytime Hive makes a release, those constants change in value and that
> is private API because of the lack of backwards-compat, which is violated
> by SparkSQL.
>
> So Hive-on-Spark forces mismatched versions of Hive classes, because it's
> a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
> laws of causality.
>
> Spark cannot depend on a version of Hive that is unreleased and
> Hive-on-Spark release cannot depend on a version of Spark that is
> unreleased.
>
> Cheers,
> Gopal
>
>
>


Do you have more suggestions on when to use Hive on MapReduce or Hive on Spark?

2015-11-04 Thread Jone Zhang
Hi, Xuefu
 we plan to move the Hive on MapReduce to Hive on Spark selectively.
Because the disposition of cluser consisting of the compute nodes is
uneven, we chose the following disposition at last.

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled   true
spark.dynamicAllocation.minExecutors10
spark.rdd.compress  true

spark.executor.cores2
spark.executor.memory   7000m
spark.yarn.executor.memoryOverhead  1024

 We sample test dozens of operating online SQL, expecting to find out
which can run on MapReduce and which can run on Spark under the limited
resources.
 Following tios are the conclusion.
 1. If the SQL is not contain shuffle stage, use Hive on MapReduce,
such as  mapjoin and seclect * from table where...
  2. About the SQL which has been join with many times, such as seclect
from table 1 join table 2 join table 3, it is highly suitable for using
Hive on Spark.
  3. As to multi-insert, using Hive on Spark is much faster than using
Hive on MapReduce.
  4. it's possible to occur ''Container killed be YERN for exceeding
memory limits" when using large date which shuttle over 10T, so we don't
advice to use Hive on Spark.

 Do you have more suggestions on when to use Hive on MapReduce or Hive
on Spark? Anyway , you are the writer. ☺

  Best wishes!
  Thank you!


Re: Hive on Spark NPE at org.apache.hadoop.hive.ql.io.HiveInputFormat

2015-11-01 Thread Jone Zhang
pls check and  attach the application master log.

2015-11-02 8:03 GMT+08:00 Jagat Singh :

> Hi,
>
> I am trying to run Hive on Spark on HDP Virtual machine 2.3
>
> Following wiki
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> I have replaced all the occurrences of hdp.version with 2.3.0.0-2557
>
> I start hive with following
>
> set hive.execution.engine=spark;
> set spark.master=yarn-client;
> set spark.executor.memory=512m;
>
> I run the query
>
> select count(*) from sample_07;
>
> The query starts and fails with following error.
>
> In console
>
> Status: Running (Hive on Spark job[0])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2015-11-01 23:40:26,411 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1
> state = FAILED
> Status: Failed
> FAILED: Execution Error, return code 3 from
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> hive> select count(*) from sample_07;
>
> In the logs
>
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl: 2015-11-01 
> 23:55:36,313 INFO  - [pool-1-thread-1:] ~ Failed to run job 
> b8649c92-1504-43c7-8100-020b866e58da (RemoteDriver:389)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl: 
> java.util.concurrent.ExecutionException: Exception thrown by job
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:311)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:316)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:382)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> java.lang.Thread.run(Thread.java:745)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl: Caused by: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, sandbox.hortonworks.com): java.lang.NullPointerException
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:437)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:430)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:236)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client.SparkClientImpl:  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 15/11/01 23:55:36 [stdout-redir-1]: INFO client

Re: Hive on Spark

2015-10-23 Thread Jone Zhang
I get an the error every time while I run a query on a large data set. I
think use MEMORY_AND_DISK can avoid this problem under the limited
resources.
"15/10/23 17:37:13 Reporter WARN
org.apache.spark.deploy.yarn.YarnAllocator>> Container killed by YARN for
exceeding memory limits. 7.6 GB of 7.5 GB physical memory used. Consider
boosting spark.yarn.executor.memoryOverhead."

2015-10-23 19:40 GMT+08:00 Xuefu Zhang :

> Yeah. for that, you cannot really cache anything through Hive on Spark.
> Could you detail more what you want to achieve?
>
> When needed, Hive on Spark uses memory+disk for storage level.
>
> On Fri, Oct 23, 2015 at 4:29 AM, Jone Zhang 
> wrote:
>
>> 1.But It's no way to set Storage Level through properties file in spark,
>> Spark provided "def persist(newLevel: StorageLevel)"
>> api only...
>>
>> 2015-10-23 19:03 GMT+08:00 Xuefu Zhang :
>>
>>> quick answers:
>>> 1. you can pretty much set any spark configuration at hive using set
>>> command.
>>> 2. no. you have to make the call.
>>>
>>>
>>>
>>> On Thu, Oct 22, 2015 at 10:32 PM, Jone Zhang 
>>> wrote:
>>>
>>>> 1.How can i set Storage Level when i use Hive on Spark?
>>>> 2.Do Spark have any intention of  dynamically determined Hive on
>>>> MapReduce or Hive on Spark, base on SQL features.
>>>>
>>>> Thanks in advance
>>>> Best regards
>>>>
>>>
>>>
>>
>


Re: Hive on Spark

2015-10-23 Thread Jone Zhang
1.But It's no way to set Storage Level through properties file in spark,
Spark provided "def persist(newLevel: StorageLevel)"
api only...

2015-10-23 19:03 GMT+08:00 Xuefu Zhang :

> quick answers:
> 1. you can pretty much set any spark configuration at hive using set
> command.
> 2. no. you have to make the call.
>
>
>
> On Thu, Oct 22, 2015 at 10:32 PM, Jone Zhang 
> wrote:
>
>> 1.How can i set Storage Level when i use Hive on Spark?
>> 2.Do Spark have any intention of  dynamically determined Hive on
>> MapReduce or Hive on Spark, base on SQL features.
>>
>> Thanks in advance
>> Best regards
>>
>
>


Hive on Spark

2015-10-22 Thread Jone Zhang
1.How can i set Storage Level when i use Hive on Spark?
2.Do Spark have any intention of  dynamically determined Hive on MapReduce
or Hive on Spark, base on SQL features.

Thanks in advance
Best regards