Thanks All. I will implement the suggested points and share the output. Thanks again for all the help.
Thanks and Regards Nishant Aggarwal, PMP Cell No:- +91 99588 94305 On Fri, Aug 21, 2015 at 10:33 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Additionally, although it is a PoC you should have a realistic data model. > Furthermore, following good data modeling practices should be taken into > account. Joining on a double is not one of them. It should be int. > Furthermore, double is a type that is in most scenarios rarely used. In the > business world you have usually something like decimal(precision,scale) > where you exactly define precision and scale (usually x,2 for all stuff > related to money). The reason is that rounding needs to be consistent > across all analytics applications. > Also you may then partition of course the table. > > Le jeu. 20 août 2015 à 15:46, Xuefu Zhang <xzh...@cloudera.com> a écrit : > >> Please check out HIVE-11502. For your poc, you can simply get around using >> other data types instead of double. >> >> On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal <nishant....@gmail.com> >> wrote: >> >> > Thanks for the reply Noam. I have already tried the later point of >> > dividing the query. But the challenge comes during the joining of the >> table. >> > >> > >> > Thanks and Regards >> > Nishant Aggarwal, PMP >> > Cell No:- +91 99588 94305 >> > >> > >> > On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <noam.has...@kenshoo.com> >> > wrote: >> > >> >> Hi, >> >> >> >> Have you look at counters in Hadoop side? It's possible you are dealing >> >> with a bad join which causes multiplication of items, if you see huge >> >> number of record input/output in map/reduce phase and keeps increasing >> >> that's probably the case. >> >> >> >> Another thing I would try is to divide the job into several different >> >> smaller queries, for example start with filter only, after than join >> and so >> >> on. >> >> >> >> Noam. >> >> >> >> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal < >> nishant....@gmail.com >> >> > wrote: >> >> >> >>> Dear Hive Users, >> >>> >> >>> I am in process of running over a poc to one of my customer >> >>> demonstrating the huge performance benefits of Hadoop BigData using >> Hive. >> >>> >> >>> Following is the problem statement i am stuck with. >> >>> >> >>> I have generate a large table with 28 columns( all are double). Table >> >>> size on disk is 70GB (i ultimately created compressed table using ORC >> >>> format to save disk space bringing down the table size to < 1GB) with >> more >> >>> than 450Million records. >> >>> >> >>> In order to demonstrate a complex use case i joined this table with >> >>> itself. Following are the queries i have used to create table and >> join >> >>> query i am using. >> >>> >> >>> *Create Table and Loading Data, Hive parameters settigs:* >> >>> set hive.vectorized.execution.enabled = true; >> >>> set hive.vectorized.execution.reduce.enabled = true; >> >>> set mapred.max.split.size=100000000; >> >>> set mapred.min.split.size=1000000; >> >>> set hive.auto.convert.join=false; >> >>> set hive.enforce.sorting=true; >> >>> set hive.enforce.bucketing=true; >> >>> set hive.exec.dynamic.partition=true; >> >>> set hive.exec.dynamic.partition.mode=nonstrict; >> >>> set mapreduce.reduce.input.limit=-1; >> >>> set hive.exec.parallel = true; >> >>> >> >>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3 >> >>> double,col4 double,col5 double,col6 double,col7 double,col8 >> double,col9 >> >>> double,col10 double,col11 double,col12 double,col13 double,col14 >> >>> double,col15 double,col16 double,col17 double,col18 double,col19 >> >>> double,col20 double,col21 double,col22 double,col23 double,col24 >> >>> double,col25 double,col26 double,col27 double,col28 double) >> >>> clustered by (col1) sorted by (col1) into 240 buckets >> >>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); >> >>> >> >>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2 >> >>> select * sort by col1; >> >>> >> >>> >> >>> *JOIN QUERY:* >> >>> >> >>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as >> AVG5 >> >>> from huge_numeric_table_orc2 t1 left outer join >> huge_numeric_table_orc2 t2 >> >>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12 >> >>> >> >>> >> >>> *The problem is that this query gets stuck at reducers :80-85%. and >> goes >> >>> in a loop and never finishes. * >> >>> >> >>> Version of Hive is 1.2. >> >>> >> >>> Please help. >> >>> >> >>> >> >>> Thanks and Regards >> >>> Nishant Aggarwal, PMP >> >>> Cell No:- +91 99588 94305 >> >>> >> >>> >> >> >> >> This e-mail, as well as any attached document, may contain material >> which >> >> is confidential and privileged and may include trademark, copyright and >> >> other intellectual property rights that are proprietary to Kenshoo Ltd, >> >> its subsidiaries or affiliates ("Kenshoo"). This e-mail and its >> >> attachments may be read, copied and used only by the addressee for the >> >> purpose(s) for which it was disclosed herein. If you have received it >> in >> >> error, please destroy the message and any attachment, and contact us >> >> immediately. If you are not the intended recipient, be aware that any >> >> review, reliance, disclosure, copying, distribution or use of the >> contents >> >> of this message without Kenshoo's express permission is strictly >> prohibited. >> > >> > >> > >> >