Re: hive on spark job not start enough executors

2016-09-09 Thread
All the parameters except spark.executor.instances are specified in 
spark-default.conf located in hive's conf folder.  So I think it's a yes.

I also checked on spark's web page when a hive on spark job is running, the 
parameters shown on the web page are exactly what I specified in the config 
file including spark.shuffle.service.enabled and 
spark.dynamicAllocation.enabled.


Should I specify a fixed executor.instances in the file? But it's not good for 
me.


By the way, the data source of my query is parquet files. In hive side I just 
created a external table from the parquet.



Thanks,

Minghao Feng


From: Mich Talebzadeh 
Sent: Friday, September 9, 2016 4:49:55 PM
To: user
Subject: Re: hive on spark job not start enough executors

when you start hive on spark do you set any parameters for the submitted job 
(or read them from init file)?

set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.instances=2;
set spark.ui.port=;


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 9 September 2016 at 09:30, ?? ? 
mailto:qiuff...@hotmail.com>> wrote:

Hi there,


I encountered a problem that makes hive on spark with a very low performance.

I'm using spark 1.6.2 and hive 2.1.0, I specified


spark.shuffle.service.enabledtrue
spark.dynamicAllocation.enabled  true

in my spark-default.conf file (the file is in both spark and hive conf folder) 
to make spark job to get executors dynamically.
The configuration works correctly when I run spark jobs, but when I use hive on 
spark, it only started a few executors although there are more enough cores and 
memories to start more executors.
For example, for the same SQL query, if I run on sparkSQL, it can start more 
than 20 executors, but with hive on spark, only 3.

How can I improve the performance on hive on spark? Any suggestions please.

Thanks,
Minghao Feng




Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-17 Thread
Hi Gopal,


It works when I disabled the dfs.namenode.acls.

For the data loss, it doesn't affect me too much currently. But I will track 
the issue in Kylin.

Thank you very much for your detailed explain and solution.  You saved me!


Best Regards,

Minghao Feng


From: Gopal Vijayaraghavan  on behalf of Gopal 
Vijayaraghavan 
Sent: Wednesday, August 17, 2016 1:18:54 PM
To: user@hive.apache.org
Subject: Re: hive throws ConcurrentModificationException when executing insert 
overwrite table


> Yes, Kylin generated the query. I'm using Kylin 1.5.3.

I would report a bug to Kylin about DISTRIBUTE BY RAND().

This is what happens when a node which ran a Map task fails and the whole
task is retried.

Assume that the first attempt of the Map task0 wrote value1 into
reducer-99, because RAND() returned 99.

Now the task succeeds and then reducer starts, running reducer-0
successfully, which write _0.

But before reducer-99 runs, the node which ran Map task0 crashes.

So, the engine re-runs Map task0 on another node. Except because RAND() is
completely random, it may give 0 as the output of RAND() for "value1".

The reducer-0 output from Map task0 now has "value1", except there's no
task which will ever read that out or write that out.

In short, the output of the table will not contain "value1", despite the
input and the shuffle outputs containing "value1".

I would replace the DISTRIBUTE BY RAND() with SORT BY 0, for a random
distribution without data loss.

> But I still not sure how can I fix the problem. I'm a beginner of Hive
>and Kylin, Can the problem be fixed by just change the hive or kylin
>settings?

If you're just experimenting with Kylin right now, I recommend just
disabling the ACL settings in HDFS (this is not permissions btw, ACLs are
permissions++).

Set dfs.namenode.acls.enabled=false in core-site.xml and wherever else in
your /etc/hadoop/conf it shows up and you should be good to avoid the race
condition.

Cheers,
Gopal




Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-16 Thread
Hi Gopal,


Thanks for your comment.

Yes, Kylin generated the query. I'm using Kylin 1.5.3.

But I still not sure how can I fix the problem. I'm a beginner of Hive and 
Kylin, Can the problem be fixed by just change the hive or kylin settings?

The total data is about 1 billion lines, I'm trying to build a cube as the base 
and then dealing with the increment everyday. Show I separate the 1 billion 
lines to hundreds of pieces and then build the cube?


Thanks,

Minghao Feng


From: Gopal Vijayaraghavan  on behalf of Gopal 
Vijayaraghavan 
Sent: Wednesday, August 17, 2016 11:10:45 AM
To: user@hive.apache.org
Subject: Re: hive throws ConcurrentModificationException when executing insert 
overwrite table


> This problem has blocked me a whole week, anybodies have any ideas?

This might be a race condition here.




aclStatus.getEntries(); is being modified without being copied (oddly with
Kerberos, it might be okay).


>> >= '1970-01-01 01:00:00' AND TBL_HIS_UWIP_SCAN_PROM.START_TIME <
>>'2010-01-01 01:00:00') DISTRIBUTE BY RAND();

Did Kylin generate this query? This pattern is known to cause data loss
during runtime.

Distribute BY RAND() loses data when map tasks fail.

>at org.apache.hadoop.hdfs.DFSClient.setAcl(DFSClient.java:3242)
...
>at
>org.apache.hadoop.hive.io.HdfsUtils.setFullFileStatus(HdfsUtils.java:126)

> An interesting thing is that if I narrow down the 'where' to make the
>select query only return about 300,000 line, the insert SQL can be
>completed successfully.

Producing exactly 1 file will fix the issue.

Cheers,
Gopal












Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-16 Thread
Hi,


This problem has blocked me a whole week, anybodies have any ideas?

Many thanks.


Mh F


From: 明浩 冯 
Sent: Monday, August 15, 2016 2:43:58 PM
To: user@hive.apache.org
Subject: hive throws ConcurrentModificationException when executing insert 
overwrite table


Hi everyone,


When I run the following SQL in beeline, hive just throws a 
ConcurrentModificationException. Anybody knows what's wrong in my hive? Or give 
me some ideas to target where the problem is?


INSERT OVERWRITE TABLE 
kylin_intermediate_prom_group_by_ws_name_cur_cube_1970010101_2010010101 
SELECT TBL_HIS_UWIP_SCAN_PROM.ORDER_NAME FROM TESTMES.TBL_HIS_UWIP_SCAN_PROM as 
TBL_HIS_UWIP_SCAN_PROM  WHERE (TBL_HIS_UWIP_SCAN_PROM.START_TIME >= '1970-01-01 
01:00:00' AND TBL_HIS_UWIP_SCAN_PROM.START_TIME < '2010-01-01 01:00:00') 
DISTRIBUTE BY RAND();


My environment:

12 nodes cluster with

Hadoop 2.7.2

Spark 1.6.2

Zookeeper 3.4.6

Hbase 1.2.2

Hive 2.1.0

Kylin 1.5.3


Also list some settings in hive-site.xml I think maybe helpful for you to 
analyze the problem:

hive.support.concurrency=true

hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager

hive.execution.engine=spark

hive.server2.transport.mode=http

hive.server2.authentication=NONE


Actually it's one step of building a Kylin cube.  The select query returns 
about 3,000,000 lines. Here is the log I got from hive.log:


2016-08-12T18:43:07,473 INFO  [HiveServer2-Background-Pool: Thread-83]: 
status.SparkJobMonitor (:()) - 2016-08-12 18:43:07,472  Stage-0_0: 58/58 
Finished   Stage-1_0: 13/13 Finished
2016-08-12T18:43:07,476 INFO  [HiveServer2-Background-Pool: Thread-83]: 
status.SparkJobMonitor (:()) - Status: Finished successfully in 264.96 seconds
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - =Spark Job[85a00425-c044-4e22-b54a-f2c12feb4e82] 
statistics=
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - Spark Job[85a00425-c044-4e22-b54a-f2c12feb4e82] Metrics
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - ExecutorDeserializeTime: 157772
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - ExecutorRunTime: 4102583
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - ResultSize: 149069
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - JvmGCTime: 234246
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - ResultSerializationTime: 23
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - MemoryBytesSpilled: 0
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - DiskBytesSpilled: 0
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - BytesRead: 6831052047
2016-08-12T18:43:07,488 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - RemoteBlocksFetched: 702
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - LocalBlocksFetched: 52
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - TotalBlocksFetched: 754
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - FetchWaitTime: 12
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - RemoteBytesRead: 2611264054
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - ShuffleBytesWritten: 2804791500
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - ShuffleWriteTime: 56641742751
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - HIVE
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - CREATED_FILES: 13
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - 
RECORDS_OUT_1_default.kylin_intermediate_prom_group_by_ws_name_cur_cube_1970010101_2010010101:
 271942413
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - RECORDS_IN: 1076808610
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - RECORDS_OUT_INTERMEDIATE: 271942413
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]: 
spark.SparkTask (:()) - DESERIALIZE_ERRORS: 0
2016-08-12T18:43:07,489 INFO  [HiveServer2-Background-Pool: Thread-83]