Re: Why is a single INSERT very slow in Hive?

Furcy Pin Tue, 12 Sep 2017 01:53:40 -0700

Hi,

this is a very common question, as many people knowing SQL are used to
RDBMS like MySQL, Oracle, or SQL Server.
The first thing you need to know about Hive is that, in the first place, it
has not been designed to replace
such databases. Not when they are used for transaction processing anyway.


This is stated in the Hive Tutorial, in the section "What Hive is NOT"
https://cwiki.apache.org/confluence/display/Hive/Tutorial

Transcationality is a feature that has been added afterwards, and as Jörn
stated, it can be fast when used with TEZ + LLAP.

In the logs you sent, you can see that Hive is running a MapReduce job to
perform your task:

MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
Write: 81 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 700 msec
OK

If you are familiar with MapReduce, you should now that however small your
dataset is, what takes time here
is:

   1. Provisionning executors on YARN
   2. Starting one JVM per mapper and reducer (here you only have 1 mapper
   and 0 reducer, as stated by the logs). The starting time of a JVM is
   generally a few seconds nowadays, except when to have to load 200Mb of
   Hadoop jars to start, in that case it takes around 10 seconds.
   3. Perform the task (a few milliseconds in your case)
   4. Persisting the results on HDFS (which requires a few hdfs operations
   and can take a few seconds to minutes if you write a lot of files, but in
   your case should be quick)

The main improvement that a Tez LLAP or Spark backend will do is that the
first to steps are already done
and waiting for your queries to run.
Hive as been designed in the first place to go faster than RDBMS in the
cases where:
- your data is too large to fit on a single instance, and sharding is
painful
- your jobs mostly consist in analytical processing, like full table
aggregations

In such case, the correct way to use Hive is by partitioning your table by
day (if you run nightly batches)
and generate a new partition every day.
If you want to change something in your table (e.g. fix a bug), you just
regenerate it.

If you need fast response time for updating and fetching records, and
scalability, perhaps you should look into HBase, Cassandra, or Kudu.




On Mon, Sep 11, 2017 at 9:18 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Why do you want to do single inserts?
> It has been more designed for bulk loads.
> In any case newer version of Hive 2 using TEZ +llap improve it
> significantly (also for bulk analysis). Nevertheless, it is good practice
> to not use single inserts in an analysis systems, but try to combine and
> bulk-load them.
>
> On 11. Sep 2017, at 21:01, Jinhui Qin <qin.jin...@gmail.com> wrote:
>
>
>
> Hi,
>
> I am new to Hive. I just created a simple table in hive and inserted two
> records, the first insertion took 16.4 sec, while the second took 14.3 sec.
> Why is that very slow? is this the normal performance you get in Hive using
> INSERT ? Is there a way to improve the performance of a single "insert" in
> Hive? Any help would be really appreciated. Thanks!
>
> Here is the record from a terminal in Hive shell:
>
> =========================
>
> hive> show tables;
> OK
> Time taken: 2.758 seconds
> hive> create table people(id int, name string, age int);
> OK
> Time taken: 0.283 seconds
> hive> insert into table people(1,'Tom A', 20);
> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0098, Tracking URL =
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
> job_1505146047428_0098
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 0
> 2017-09-11 13:41:01,492 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:06,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> 2.7 sec
> MapReduce Total cumulative CPU time: 2 seconds 700 msec
> Ended Job = job_1505146047428_0098
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/
> people/.hive-staging_hive_2017-09-11_13-40-52_106_462156758110461544
> 1-1/-ext-10000
> Loading data to table default.people
> Table default.people stats: [numFiles=1, numRows=1, totalSize=11,
> rawDataSize=10]
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 700 msec
> OK
> Time taken: 16.417 seconds
> hive> insert into table people values(1,'Tom A', 20);
> Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0099, Tracking URL =
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0099/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
> job_1505146047428_0099
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 0
> 2017-09-11 13:41:36,289 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:40,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> 2.28 sec
> MapReduce Total cumulative CPU time: 2 seconds 280 msec
> Ended Job = job_1505146047428_0099
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/
> people/.hive-staging_hive_2017-09-11_13-41-28_757_445847252207124056
> 7-1/-ext-10000
> Loading data to table default.people
> Table default.people stats: [numFiles=2, numRows=2, totalSize=22,
> rawDataSize=20]
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.28 sec   HDFS Read: 3924 HDFS
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 280 msec
> OK
> Time taken: 14.288 seconds
> hive> exit;
> =================
>
>
> Jinhui
>
>

Re: Why is a single INSERT very slow in Hive?

Reply via email to