Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase? Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
> On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Hi Jorn, > > Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not > tried it on TEZ. I tried the query on MR engine and it did nor fair better. I > also ran it without SDDDEV function and found out that the function did not > slow it down. > > I tried a simple query as follows builr in sales FACT table 1e9 rows and > dimension table times (1826 rows) > > -- > -- Get the total amount sold for each calendar month > -- > SELECT t.calendar_month_desc, SUM(s.amount_sold) > FROM sales s, times t WHERE s.time_id = t.time_id > GROUP BY t.calendar_month_desc; > > Now Sybase IQ comes back in around 30 seconds. > > Started query at Dec 30 2015 08:14:33:399AM > (48 rows affected) > Finished query at Dec 30 2015 08:15:04:640AM > > Whereas Hive with the following setting and running the same query > > set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; > set hive.optimize.bucketmapjoin=true; > set hive.optimize.bucketmapjoin.sortedmerge=true; > > Comes back in > > 48 rows selected (1514.687 seconds) > > I don’t know what else can be done. Obviously this is all schema on read so I > am not sure I can change bucketing on FACT table based on one query alone! > > > > +--------------------------------------------------------------------+--+ > | createtab_stmt | > +--------------------------------------------------------------------+--+ > | CREATE TABLE `times`( | > | `time_id` timestamp, | > | `day_name` varchar(9), | > | `day_number_in_week` int, | > | `day_number_in_month` int, | > | `calendar_week_number` int, | > | `fiscal_week_number` int, | > | `week_ending_day` timestamp, | > | `week_ending_day_id` bigint, | > | `calendar_month_number` int, | > | `fiscal_month_number` int, | > | `calendar_month_desc` varchar(8), | > ---------- > | `days_in_fis_year` bigint, | > | `end_of_cal_year` timestamp, | > | `end_of_fis_year` timestamp) | > | CLUSTERED BY ( | > | time_id) | > | INTO 256 BUCKETS | > | ROW FORMAT SERDE | > | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | > | STORED AS INPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | > | OUTPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | > | LOCATION | > | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' | > | TBLPROPERTIES ( | > | 'COLUMN_STATS_ACCURATE'='true', | > | 'numFiles'='1', | > | 'numRows'='1826', | > | 'orc.bloom.filter.columns'='TIME_ID', | > | 'orc.bloom.filter.fpp'='0.05', | > | 'orc.compress'='SNAPPY', | > | 'orc.create.index'='true', | > | 'orc.row.index.stride'='10000', | > | 'orc.stripe.size'='268435456', | > | 'rawDataSize'='0', | > | 'totalSize'='11155', | > | 'transient_lastDdlTime'='1451429900') | > > ; > > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: 30 December 2015 08:28 > To: user@hive.apache.org > Subject: Re: Running the same query on 1 billion rows fact table in Hive on > Spark compared to Sybase IQ columnar database > > Have you tried it with Hive ob TEZ? It contains (currently) more > optimizations than Hive on Spark. > I assume you use the latest Hive version. > Additionally you may want to think about calculating statistics (depending on > your configuration you need to trigger it) - I am not sure if Spark can use > them. > I am not sure if bloom filters on the columns you mention make sense. You may > also want to increase stride size (depending on your data). > Currently you bucket by a lot of fields, which may not make sense. You also > may want to sort the data by customer Id in the table. > You also seem to have a lot of reducers, which you may want to decrease. > > Have you tried without "having stddev_samp" ? Is the query exactly the same > as in Sybase? > > On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Hi, > > I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion > rows as follows: > > show create table sales; > +-------------------------------------------------------------------------------+--+ > | createtab_stmt > | > +-------------------------------------------------------------------------------+--+ > | CREATE TABLE `sales`( > | > | `prod_id` bigint, > | > | `cust_id` bigint, > | > | `time_id` timestamp, > | > | `channel_id` bigint, > | > | `promo_id` bigint, > | > | `quantity_sold` decimal(10,0), > | > | `amount_sold` decimal(10,0)) > | > | CLUSTERED BY ( > | > | prod_id, > | > | cust_id, > | > | time_id, > | > | channel_id, > | > | promo_id) > | > | INTO 256 BUCKETS > | > | ROW FORMAT SERDE > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > | > | STORED AS INPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > | > | OUTPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > | > | LOCATION > | > | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' > | > | TBLPROPERTIES ( > | > | 'COLUMN_STATS_ACCURATE'='true', > | > | 'last_modified_by'='hduser', > | > | 'last_modified_time'='1451305626', > | > | 'numFiles'='11', > | > | 'numRows'='1000000000', > | > | 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', > | > | 'orc.bloom.filter.fpp'='0.05', > | > | 'orc.compress'='SNAPPY', > | > | 'orc.create.index'='true', > | > | 'orc.row.index.stride'='10000', > | > | 'orc.stripe.size'='268435456', > | > | 'rawDataSize'='296000000000', > | > | 'totalSize'='2678882153', > | > | 'transient_lastDdlTime'='1451305626') > | > +-------------------------------------------------------------------------------+--+ > > I use the following query to run against sales table only against Hive > > SELECT > rs.Customer_ID > , rs.Number_of_orders > , rs.Total_customer_amount > , rs.Average_order > , rs.Standard_deviation > FROM > ( > SELECT cust_id AS Customer_ID, > COUNT(amount_sold) AS Number_of_orders, > SUM(amount_sold) AS Total_customer_amount, > AVG(amount_sold) AS Average_order, > stddev_samp(amount_sold) AS Standard_deviation > FROM sales > GROUP BY cust_id > HAVING SUM(amount_sold) > 94000 > AND AVG(amount_sold) < stddev_samp(amount_sold) > ) rs > ORDER BY > -- Total_customer_amount DESC > 3 DESC > > Hive comes back in 17 minutes with 5,948 rows > > bl -f sales.hql > sales.log > Connecting to jdbc:hive2://rhes564:10010/default > Connected to: Apache Hive (version 1.2.1) > Driver: Hive JDBC (version 1.2.1) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Running init script /home/hduser/dba/bin/hive_on_spark_init.hql > No rows affected (0.097 seconds) > No rows affected (0.001 seconds) > No rows affected (0.001 seconds) > No rows affected (0.038 seconds) > INFO : Warning: Using constant number 3 in order by. If you try to use > position alias when hive.groupby.orderby.position.alias is false, the > position alias will be ignored. > INFO : > Query Hive on Spark job[0] stages: > INFO : 0 > INFO : 1 > INFO : 2 > INFO : > Status: Running (Hive on Spark job[0]) > INFO : Job Progress Format > CurrentTime StageId_StageAttemptId: > SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount > [StageCost] > INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > > INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 > Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: > 0(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: > 6(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: > 30(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: > 59(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: > 90(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: > 124(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: > 160(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: > 193(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: > 228(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: > 262(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: > 298(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: > 338(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: > 376(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: > 417(+3)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: > 460(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: > 502(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: > 542(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: > 584(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: > 624(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: > 667(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: > 709(+3)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: > 754(+3)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: > 797(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: > 844(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: > 888(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: > 934(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: > 981(+2)/1009 Stage-2_0: 0/1 > INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: > 1009/1009 Finished Stage-2_0: 0(+1)/1 > INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: > 1009/1009 Finished Stage-2_0: 1/1 Finished > INFO : Status: Finished successfully in 1036.00 seconds > 5,948 rows selected (1074.817 seconds) > > So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 > seconds > > Sybase IQ is a columnar database so each column is created as a fast > projection index by default. In addition I have created LF (bitmap) indexes > on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now > the query only touches CUST_ID. > > My suspicion is that it is the Standard Deviation function stddev_samp() that > could be the bottleneck? > > Thanks > > Mich Talebzadeh > > Sybase ASE 15 Gold Medal Award 2008 > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", > ISBN 978-0-9563693-0-7. > co-author "Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4 > Publications due shortly: > Complex Event Processing in Heterogeneous Environments, ISBN: > 978-0-9563693-3-8 > Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume > one out shortly > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. >