Re: HIVE:1.2, Query taking huge time

Nishant Aggarwal Fri, 21 Aug 2015 02:25:00 -0700

Thanks All.
I will implement the suggested points and share the output.

Thanks again for all the help.


Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305

On Fri, Aug 21, 2015 at 10:33 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Additionally, although it is a PoC you should have a realistic data model.
> Furthermore, following good data modeling practices should be taken into
> account. Joining on a double is not one of them. It should be int.
> Furthermore, double is a type that is in most scenarios rarely used. In the
> business world you have usually something like decimal(precision,scale)
> where you exactly define precision  and scale (usually x,2 for all stuff
> related to money). The reason is that rounding needs to be consistent
> across all analytics applications.
> Also you may then partition of course the table.
>
> Le jeu. 20 août 2015 à 15:46, Xuefu Zhang <xzh...@cloudera.com> a écrit :
>
>> Please check out HIVE-11502. For your poc, you can simply get around using
>> other data types instead of double.
>>
>> On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <nishant....@gmail.com>
>> wrote:
>>
>> > Thanks for the reply Noam. I have already tried the later point of
>> > dividing the query. But the challenge comes during the joining of the
>> table.
>> >
>> >
>> > Thanks and Regards
>> > Nishant Aggarwal, PMP
>> > Cell No:- +91 99588 94305
>> >
>> >
>> > On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <noam.has...@kenshoo.com>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> Have you look at counters in Hadoop side? It's possible you are dealing
>> >> with a bad join which causes multiplication of items, if you see huge
>> >> number of record input/output in map/reduce phase and keeps increasing
>> >> that's probably the case.
>> >>
>> >> Another thing I would try is to divide the job into several different
>> >> smaller queries, for example start with filter only, after than join
>> and so
>> >> on.
>> >>
>> >> Noam.
>> >>
>> >> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <
>> nishant....@gmail.com
>> >> > wrote:
>> >>
>> >>> Dear Hive Users,
>> >>>
>> >>> I am in process of running over a poc to one of my customer
>> >>> demonstrating the huge performance benefits of Hadoop BigData using
>> Hive.
>> >>>
>> >>> Following is the problem statement i am stuck with.
>> >>>
>> >>> I have generate a large table with 28 columns( all are double). Table
>> >>> size on disk is 70GB (i ultimately created compressed table using ORC
>> >>> format to save disk space bringing down the table size to < 1GB) with
>> more
>> >>> than 450Million records.
>> >>>
>> >>> In order to demonstrate a complex use case i joined this table with
>> >>> itself. Following are the queries i have used to create table and
>> join
>> >>> query i am using.
>> >>>
>> >>> *Create Table and Loading Data, Hive parameters settigs:*
>> >>> set hive.vectorized.execution.enabled = true;
>> >>> set hive.vectorized.execution.reduce.enabled = true;
>> >>> set mapred.max.split.size=100000000;
>> >>> set mapred.min.split.size=1000000;
>> >>> set hive.auto.convert.join=false;
>> >>> set hive.enforce.sorting=true;
>> >>> set hive.enforce.bucketing=true;
>> >>> set hive.exec.dynamic.partition=true;
>> >>> set hive.exec.dynamic.partition.mode=nonstrict;
>> >>> set mapreduce.reduce.input.limit=-1;
>> >>> set hive.exec.parallel = true;
>> >>>
>> >>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>> >>> double,col4 double,col5 double,col6 double,col7 double,col8
>> double,col9
>> >>> double,col10 double,col11 double,col12 double,col13 double,col14
>> >>> double,col15 double,col16 double,col17 double,col18 double,col19
>> >>> double,col20 double,col21 double,col22 double,col23 double,col24
>> >>> double,col25 double,col26 double,col27 double,col28 double)
>> >>> clustered by (col1) sorted by (col1) into 240 buckets
>> >>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>> >>>
>> >>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>> >>> select * sort by col1;
>> >>>
>> >>>
>> >>> *JOIN QUERY:*
>> >>>
>> >>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
>> AVG5
>> >>> from huge_numeric_table_orc2 t1 left outer join
>> huge_numeric_table_orc2 t2
>> >>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>> >>>
>> >>>
>> >>> *The problem is that this query gets stuck at reducers :80-85%. and
>> goes
>> >>> in a loop and never finishes. *
>> >>>
>> >>> Version of Hive is 1.2.
>> >>>
>> >>> Please help.
>> >>>
>> >>>
>> >>> Thanks and Regards
>> >>> Nishant Aggarwal, PMP
>> >>> Cell No:- +91 99588 94305
>> >>>
>> >>>
>> >>
>> >> This e-mail, as well as any attached document, may contain material
>> which
>> >> is confidential and privileged and may include trademark, copyright and
>> >> other intellectual property rights that are proprietary to Kenshoo Ltd,
>> >>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
>> >> attachments may be read, copied and used only by the addressee for the
>> >> purpose(s) for which it was disclosed herein. If you have received it
>> in
>> >> error, please destroy the message and any attachment, and contact us
>> >> immediately. If you are not the intended recipient, be aware that any
>> >> review, reliance, disclosure, copying, distribution or use of the
>> contents
>> >> of this message without Kenshoo's express permission is strictly
>> prohibited.
>> >
>> >
>> >
>>
>

Re: HIVE:1.2, Query taking huge time

Reply via email to