Re: HIVE:1.2, Query taking huge time

2015-08-21 Thread Nishant Aggarwal
Thanks All.
I will implement the suggested points and share the output.

Thanks again for all the help.

Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305

On Fri, Aug 21, 2015 at 10:33 AM, Jörn Franke jornfra...@gmail.com wrote:

 Additionally, although it is a PoC you should have a realistic data model.
 Furthermore, following good data modeling practices should be taken into
 account. Joining on a double is not one of them. It should be int.
 Furthermore, double is a type that is in most scenarios rarely used. In the
 business world you have usually something like decimal(precision,scale)
 where you exactly define precision  and scale (usually x,2 for all stuff
 related to money). The reason is that rounding needs to be consistent
 across all analytics applications.
 Also you may then partition of course the table.

 Le jeu. 20 août 2015 à 15:46, Xuefu Zhang xzh...@cloudera.com a écrit :

 Please check out HIVE-11502. For your poc, you can simply get around using
 other data types instead of double.

 On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal nishant@gmail.com
 wrote:

  Thanks for the reply Noam. I have already tried the later point of
  dividing the query. But the challenge comes during the joining of the
 table.
 
 
  Thanks and Regards
  Nishant Aggarwal, PMP
  Cell No:- +91 99588 94305
 
 
  On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson noam.has...@kenshoo.com
  wrote:
 
  Hi,
 
  Have you look at counters in Hadoop side? It's possible you are dealing
  with a bad join which causes multiplication of items, if you see huge
  number of record input/output in map/reduce phase and keeps increasing
  that's probably the case.
 
  Another thing I would try is to divide the job into several different
  smaller queries, for example start with filter only, after than join
 and so
  on.
 
  Noam.
 
  On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal 
 nishant@gmail.com
   wrote:
 
  Dear Hive Users,
 
  I am in process of running over a poc to one of my customer
  demonstrating the huge performance benefits of Hadoop BigData using
 Hive.
 
  Following is the problem statement i am stuck with.
 
  I have generate a large table with 28 columns( all are double). Table
  size on disk is 70GB (i ultimately created compressed table using ORC
  format to save disk space bringing down the table size to  1GB) with
 more
  than 450Million records.
 
  In order to demonstrate a complex use case i joined this table with
  itself. Following are the queries i have used to create table and
 join
  query i am using.
 
  *Create Table and Loading Data, Hive parameters settigs:*
  set hive.vectorized.execution.enabled = true;
  set hive.vectorized.execution.reduce.enabled = true;
  set mapred.max.split.size=1;
  set mapred.min.split.size=100;
  set hive.auto.convert.join=false;
  set hive.enforce.sorting=true;
  set hive.enforce.bucketing=true;
  set hive.exec.dynamic.partition=true;
  set hive.exec.dynamic.partition.mode=nonstrict;
  set mapreduce.reduce.input.limit=-1;
  set hive.exec.parallel = true;
 
  CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
  double,col4 double,col5 double,col6 double,col7 double,col8
 double,col9
  double,col10 double,col11 double,col12 double,col13 double,col14
  double,col15 double,col16 double,col17 double,col18 double,col19
  double,col20 double,col21 double,col22 double,col23 double,col24
  double,col25 double,col26 double,col27 double,col28 double)
  clustered by (col1) sorted by (col1) into 240 buckets
  STORED AS ORC tblproperties (orc.compress=SNAPPY);
 
  from huge_numeric_table insert overwrite table huge_numeric_table_orc2
  select * sort by col1;
 
 
  *JOIN QUERY:*
 
  select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
 AVG5
  from huge_numeric_table_orc2 t1 left outer join
 huge_numeric_table_orc2 t2
  on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12
 
 
  *The problem is that this query gets stuck at reducers :80-85%. and
 goes
  in a loop and never finishes. *
 
  Version of Hive is 1.2.
 
  Please help.
 
 
  Thanks and Regards
  Nishant Aggarwal, PMP
  Cell No:- +91 99588 94305
 
 
 
  This e-mail, as well as any attached document, may contain material
 which
  is confidential and privileged and may include trademark, copyright and
  other intellectual property rights that are proprietary to Kenshoo Ltd,
   its subsidiaries or affiliates (Kenshoo). This e-mail and its
  attachments may be read, copied and used only by the addressee for the
  purpose(s) for which it was disclosed herein. If you have received it
 in
  error, please destroy the message and any attachment, and contact us
  immediately. If you are not the intended recipient, be aware that any
  review, reliance, disclosure, copying, distribution or use of the
 contents
  of this message without Kenshoo's express permission is strictly
 prohibited.
 
 
 




Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Nishant Aggarwal
Thanks for the reply Noam. I have already tried the later point of dividing
the query. But the challenge comes during the joining of the table.


Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305


On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson noam.has...@kenshoo.com
wrote:

 Hi,

 Have you look at counters in Hadoop side? It's possible you are dealing
 with a bad join which causes multiplication of items, if you see huge
 number of record input/output in map/reduce phase and keeps increasing
 that's probably the case.

 Another thing I would try is to divide the job into several different
 smaller queries, for example start with filter only, after than join and so
 on.

 Noam.

 On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal nishant@gmail.com
 wrote:

 Dear Hive Users,

 I am in process of running over a poc to one of my customer demonstrating
 the huge performance benefits of Hadoop BigData using Hive.

 Following is the problem statement i am stuck with.

 I have generate a large table with 28 columns( all are double). Table
 size on disk is 70GB (i ultimately created compressed table using ORC
 format to save disk space bringing down the table size to  1GB) with more
 than 450Million records.

 In order to demonstrate a complex use case i joined this table with
 itself. Following are the queries i have used to create table and  join
 query i am using.

 *Create Table and Loading Data, Hive parameters settigs:*
 set hive.vectorized.execution.enabled = true;
 set hive.vectorized.execution.reduce.enabled = true;
 set mapred.max.split.size=1;
 set mapred.min.split.size=100;
 set hive.auto.convert.join=false;
 set hive.enforce.sorting=true;
 set hive.enforce.bucketing=true;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 set mapreduce.reduce.input.limit=-1;
 set hive.exec.parallel = true;

 CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
 double,col10 double,col11 double,col12 double,col13 double,col14
 double,col15 double,col16 double,col17 double,col18 double,col19
 double,col20 double,col21 double,col22 double,col23 double,col24
 double,col25 double,col26 double,col27 double,col28 double)
 clustered by (col1) sorted by (col1) into 240 buckets
 STORED AS ORC tblproperties (orc.compress=SNAPPY);

 from huge_numeric_table insert overwrite table huge_numeric_table_orc2
 select * sort by col1;


 *JOIN QUERY:*

 select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
 from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
 on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12


 *The problem is that this query gets stuck at reducers :80-85%. and goes
 in a loop and never finishes. *

 Version of Hive is 1.2.

 Please help.


 Thanks and Regards
 Nishant Aggarwal, PMP
 Cell No:- +91 99588 94305



 This e-mail, as well as any attached document, may contain material which
 is confidential and privileged and may include trademark, copyright and
 other intellectual property rights that are proprietary to Kenshoo Ltd,
  its subsidiaries or affiliates (Kenshoo). This e-mail and its
 attachments may be read, copied and used only by the addressee for the
 purpose(s) for which it was disclosed herein. If you have received it in
 error, please destroy the message and any attachment, and contact us
 immediately. If you are not the intended recipient, be aware that any
 review, reliance, disclosure, copying, distribution or use of the contents
 of this message without Kenshoo's express permission is strictly prohibited.


Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Noam Hasson
Hi,

Have you look at counters in Hadoop side? It's possible you are dealing
with a bad join which causes multiplication of items, if you see huge
number of record input/output in map/reduce phase and keeps increasing
that's probably the case.

Another thing I would try is to divide the job into several different
smaller queries, for example start with filter only, after than join and so
on.

Noam.

On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal nishant@gmail.com
wrote:

 Dear Hive Users,

 I am in process of running over a poc to one of my customer demonstrating
 the huge performance benefits of Hadoop BigData using Hive.

 Following is the problem statement i am stuck with.

 I have generate a large table with 28 columns( all are double). Table size
 on disk is 70GB (i ultimately created compressed table using ORC format to
 save disk space bringing down the table size to  1GB) with more than
 450Million records.

 In order to demonstrate a complex use case i joined this table with
 itself. Following are the queries i have used to create table and  join
 query i am using.

 *Create Table and Loading Data, Hive parameters settigs:*
 set hive.vectorized.execution.enabled = true;
 set hive.vectorized.execution.reduce.enabled = true;
 set mapred.max.split.size=1;
 set mapred.min.split.size=100;
 set hive.auto.convert.join=false;
 set hive.enforce.sorting=true;
 set hive.enforce.bucketing=true;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 set mapreduce.reduce.input.limit=-1;
 set hive.exec.parallel = true;

 CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
 double,col10 double,col11 double,col12 double,col13 double,col14
 double,col15 double,col16 double,col17 double,col18 double,col19
 double,col20 double,col21 double,col22 double,col23 double,col24
 double,col25 double,col26 double,col27 double,col28 double)
 clustered by (col1) sorted by (col1) into 240 buckets
 STORED AS ORC tblproperties (orc.compress=SNAPPY);

 from huge_numeric_table insert overwrite table huge_numeric_table_orc2
 select * sort by col1;


 *JOIN QUERY:*

 select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
 from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
 on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12


 *The problem is that this query gets stuck at reducers :80-85%. and goes
 in a loop and never finishes. *

 Version of Hive is 1.2.

 Please help.


 Thanks and Regards
 Nishant Aggarwal, PMP
 Cell No:- +91 99588 94305



-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates (Kenshoo). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.


Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Jörn Franke
Additionally, although it is a PoC you should have a realistic data model.
Furthermore, following good data modeling practices should be taken into
account. Joining on a double is not one of them. It should be int.
Furthermore, double is a type that is in most scenarios rarely used. In the
business world you have usually something like decimal(precision,scale)
where you exactly define precision  and scale (usually x,2 for all stuff
related to money). The reason is that rounding needs to be consistent
across all analytics applications.
Also you may then partition of course the table.

Le jeu. 20 août 2015 à 15:46, Xuefu Zhang xzh...@cloudera.com a écrit :

 Please check out HIVE-11502. For your poc, you can simply get around using
 other data types instead of double.

 On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal nishant@gmail.com
 wrote:

  Thanks for the reply Noam. I have already tried the later point of
  dividing the query. But the challenge comes during the joining of the
 table.
 
 
  Thanks and Regards
  Nishant Aggarwal, PMP
  Cell No:- +91 99588 94305
 
 
  On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson noam.has...@kenshoo.com
  wrote:
 
  Hi,
 
  Have you look at counters in Hadoop side? It's possible you are dealing
  with a bad join which causes multiplication of items, if you see huge
  number of record input/output in map/reduce phase and keeps increasing
  that's probably the case.
 
  Another thing I would try is to divide the job into several different
  smaller queries, for example start with filter only, after than join
 and so
  on.
 
  Noam.
 
  On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal 
 nishant@gmail.com
   wrote:
 
  Dear Hive Users,
 
  I am in process of running over a poc to one of my customer
  demonstrating the huge performance benefits of Hadoop BigData using
 Hive.
 
  Following is the problem statement i am stuck with.
 
  I have generate a large table with 28 columns( all are double). Table
  size on disk is 70GB (i ultimately created compressed table using ORC
  format to save disk space bringing down the table size to  1GB) with
 more
  than 450Million records.
 
  In order to demonstrate a complex use case i joined this table with
  itself. Following are the queries i have used to create table and  join
  query i am using.
 
  *Create Table and Loading Data, Hive parameters settigs:*
  set hive.vectorized.execution.enabled = true;
  set hive.vectorized.execution.reduce.enabled = true;
  set mapred.max.split.size=1;
  set mapred.min.split.size=100;
  set hive.auto.convert.join=false;
  set hive.enforce.sorting=true;
  set hive.enforce.bucketing=true;
  set hive.exec.dynamic.partition=true;
  set hive.exec.dynamic.partition.mode=nonstrict;
  set mapreduce.reduce.input.limit=-1;
  set hive.exec.parallel = true;
 
  CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
  double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
  double,col10 double,col11 double,col12 double,col13 double,col14
  double,col15 double,col16 double,col17 double,col18 double,col19
  double,col20 double,col21 double,col22 double,col23 double,col24
  double,col25 double,col26 double,col27 double,col28 double)
  clustered by (col1) sorted by (col1) into 240 buckets
  STORED AS ORC tblproperties (orc.compress=SNAPPY);
 
  from huge_numeric_table insert overwrite table huge_numeric_table_orc2
  select * sort by col1;
 
 
  *JOIN QUERY:*
 
  select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as
 AVG5
  from huge_numeric_table_orc2 t1 left outer join
 huge_numeric_table_orc2 t2
  on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12
 
 
  *The problem is that this query gets stuck at reducers :80-85%. and
 goes
  in a loop and never finishes. *
 
  Version of Hive is 1.2.
 
  Please help.
 
 
  Thanks and Regards
  Nishant Aggarwal, PMP
  Cell No:- +91 99588 94305
 
 
 
  This e-mail, as well as any attached document, may contain material
 which
  is confidential and privileged and may include trademark, copyright and
  other intellectual property rights that are proprietary to Kenshoo Ltd,
   its subsidiaries or affiliates (Kenshoo). This e-mail and its
  attachments may be read, copied and used only by the addressee for the
  purpose(s) for which it was disclosed herein. If you have received it in
  error, please destroy the message and any attachment, and contact us
  immediately. If you are not the intended recipient, be aware that any
  review, reliance, disclosure, copying, distribution or use of the
 contents
  of this message without Kenshoo's express permission is strictly
 prohibited.
 
 
 



Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Xuefu Zhang
Please check out HIVE-11502. For your poc, you can simply get around using
other data types instead of double.

On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal nishant@gmail.com
wrote:

 Thanks for the reply Noam. I have already tried the later point of
 dividing the query. But the challenge comes during the joining of the table.


 Thanks and Regards
 Nishant Aggarwal, PMP
 Cell No:- +91 99588 94305


 On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson noam.has...@kenshoo.com
 wrote:

 Hi,

 Have you look at counters in Hadoop side? It's possible you are dealing
 with a bad join which causes multiplication of items, if you see huge
 number of record input/output in map/reduce phase and keeps increasing
 that's probably the case.

 Another thing I would try is to divide the job into several different
 smaller queries, for example start with filter only, after than join and so
 on.

 Noam.

 On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal nishant@gmail.com
  wrote:

 Dear Hive Users,

 I am in process of running over a poc to one of my customer
 demonstrating the huge performance benefits of Hadoop BigData using Hive.

 Following is the problem statement i am stuck with.

 I have generate a large table with 28 columns( all are double). Table
 size on disk is 70GB (i ultimately created compressed table using ORC
 format to save disk space bringing down the table size to  1GB) with more
 than 450Million records.

 In order to demonstrate a complex use case i joined this table with
 itself. Following are the queries i have used to create table and  join
 query i am using.

 *Create Table and Loading Data, Hive parameters settigs:*
 set hive.vectorized.execution.enabled = true;
 set hive.vectorized.execution.reduce.enabled = true;
 set mapred.max.split.size=1;
 set mapred.min.split.size=100;
 set hive.auto.convert.join=false;
 set hive.enforce.sorting=true;
 set hive.enforce.bucketing=true;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 set mapreduce.reduce.input.limit=-1;
 set hive.exec.parallel = true;

 CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
 double,col10 double,col11 double,col12 double,col13 double,col14
 double,col15 double,col16 double,col17 double,col18 double,col19
 double,col20 double,col21 double,col22 double,col23 double,col24
 double,col25 double,col26 double,col27 double,col28 double)
 clustered by (col1) sorted by (col1) into 240 buckets
 STORED AS ORC tblproperties (orc.compress=SNAPPY);

 from huge_numeric_table insert overwrite table huge_numeric_table_orc2
 select * sort by col1;


 *JOIN QUERY:*

 select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
 from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
 on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12


 *The problem is that this query gets stuck at reducers :80-85%. and goes
 in a loop and never finishes. *

 Version of Hive is 1.2.

 Please help.


 Thanks and Regards
 Nishant Aggarwal, PMP
 Cell No:- +91 99588 94305



 This e-mail, as well as any attached document, may contain material which
 is confidential and privileged and may include trademark, copyright and
 other intellectual property rights that are proprietary to Kenshoo Ltd,
  its subsidiaries or affiliates (Kenshoo). This e-mail and its
 attachments may be read, copied and used only by the addressee for the
 purpose(s) for which it was disclosed herein. If you have received it in
 error, please destroy the message and any attachment, and contact us
 immediately. If you are not the intended recipient, be aware that any
 review, reliance, disclosure, copying, distribution or use of the contents
 of this message without Kenshoo's express permission is strictly prohibited.