Hi, Mich, I don't think it is related to Hive or parquet partitioning. Same issue happens while working with non-partitioned parquet file using python Dataframe API. Please, take a look at following example: $ hdfs dfs -ls /user/test // I had copied partition dt=2016-07-28 to another standalone path. Found 1 items -rw-r--r-- 2 hdfs supergroup 33568823 2016-09-03 11:11 /user/test/part-00000
$ ./pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) SparkSession available as 'spark'. >>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] >>> * 57) ).collect() *response time over 3 runs skipping the first run* 0.8370630741119385 0.22276782989501953 0.7722570896148682 >>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] >>> * 58) ).collect() *response time over 3 runs skipping the first run* 4.40575098991394 4.320873022079468 3.2484099864959717 $ hdfs dfs -get /user/test/part-00000 . $ parquet-meta part-00000 file: file:/data/dump/part-00000 creator: parquet-mr file schema: hive_schema --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- actual_dsp_fee: OPTIONAL FLOAT R:0 D:1 actual_pgm_fee: OPTIONAL FLOAT R:0 D:1 actual_ssp_fee: OPTIONAL FLOAT R:0 D:1 advertiser_id: OPTIONAL INT32 R:0 D:1 advertiser_spent: OPTIONAL DOUBLE R:0 D:1 anomaly_clicks: OPTIONAL INT64 R:0 D:1 anomaly_conversions_filtered: OPTIONAL INT64 R:0 D:1 anomaly_conversions_unfiltered: OPTIONAL INT64 R:0 D:1 anomaly_decisions: OPTIONAL FLOAT R:0 D:1 bid_price: OPTIONAL FLOAT R:0 D:1 campaign_id: OPTIONAL INT32 R:0 D:1 click_prob: OPTIONAL FLOAT R:0 D:1 clicks: OPTIONAL INT64 R:0 D:1 clicks_static: OPTIONAL INT64 R:0 D:1 conv_prob: OPTIONAL FLOAT R:0 D:1 conversion_id: OPTIONAL INT64 R:0 D:1 conversions: OPTIONAL INT64 R:0 D:1 creative_id: OPTIONAL INT32 R:0 D:1 dd_convs: OPTIONAL INT64 R:0 D:1 decisions: OPTIONAL FLOAT R:0 D:1 dmp_liveramp_margin: OPTIONAL FLOAT R:0 D:1 dmp_liveramp_payout: OPTIONAL FLOAT R:0 D:1 dmp_nielsen_margin: OPTIONAL FLOAT R:0 D:1 dmp_nielsen_payout: OPTIONAL FLOAT R:0 D:1 dmp_rapleaf_margin: OPTIONAL FLOAT R:0 D:1 dmp_rapleaf_payout: OPTIONAL FLOAT R:0 D:1 e: OPTIONAL FLOAT R:0 D:1 expected_cpa: OPTIONAL FLOAT R:0 D:1 expected_cpc: OPTIONAL FLOAT R:0 D:1 expected_payout: OPTIONAL FLOAT R:0 D:1 first_impressions: OPTIONAL INT64 R:0 D:1 fraud_clicks: OPTIONAL INT64 R:0 D:1 fraud_impressions: OPTIONAL INT64 R:0 D:1 g: OPTIONAL FLOAT R:0 D:1 impressions: OPTIONAL FLOAT R:0 D:1 line_item_id: OPTIONAL INT32 R:0 D:1 mail_type: OPTIONAL BINARY O:UTF8 R:0 D:1 noads: OPTIONAL FLOAT R:0 D:1 predict_version: OPTIONAL INT64 R:0 D:1 publisher_id: OPTIONAL INT32 R:0 D:1 publisher_revenue: OPTIONAL DOUBLE R:0 D:1 pvc: OPTIONAL INT64 R:0 D:1 second_price: OPTIONAL FLOAT R:0 D:1 thirdparty_margin: OPTIONAL FLOAT R:0 D:1 thirdparty_payout: OPTIONAL FLOAT R:0 D:1 row group 1: RC:769163 TS:40249546 OFFSET:4 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- actual_dsp_fee: FLOAT SNAPPY DO:0 FPO:4 SZ:1378278/1501551/1.09 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED actual_pgm_fee: FLOAT SNAPPY DO:0 FPO:1378282 SZ:39085/42374/1.08 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED actual_ssp_fee: FLOAT SNAPPY DO:0 FPO:1417367 SZ:1426888/1553337/1.09 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED advertiser_id: INT32 SNAPPY DO:0 FPO:2844255 SZ:339061/572962/1.69 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED advertiser_spent: DOUBLE SNAPPY DO:0 FPO:3183316 SZ:2731185/3788429/1.39 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED anomaly_clicks: INT64 SNAPPY DO:0 FPO:5914501 SZ:35/33/0.94 VC:769163 ENC:PLAIN,RLE,BIT_PACKED anomaly_conversions_filtered: INT64 SNAPPY DO:0 FPO:5914536 SZ:365/351/0.96 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED anomaly_conversions_unfiltered: INT64 SNAPPY DO:0 FPO:5914901 SZ:2734/3766/1.38 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED anomaly_decisions: FLOAT SNAPPY DO:0 FPO:5917635 SZ:35/33/0.94 VC:769163 ENC:PLAIN,RLE,BIT_PACKED bid_price: FLOAT SNAPPY DO:0 FPO:5917670 SZ:2783971/3076795/1.11 VC:769163 ENC:PLAIN,RLE,BIT_PACKED campaign_id: INT32 SNAPPY DO:0 FPO:8701641 SZ:562105/881397/1.57 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED click_prob: FLOAT SNAPPY DO:0 FPO:9263746 SZ:2968859/3076795/1.04 VC:769163 ENC:PLAIN,RLE,BIT_PACKED clicks: INT64 SNAPPY DO:0 FPO:12232605 SZ:126877/281792/2.22 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED clicks_static: INT64 SNAPPY DO:0 FPO:12359482 SZ:5775/8704/1.51 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED conv_prob: FLOAT SNAPPY DO:0 FPO:12365257 SZ:2463487/3076795/1.25 VC:769163 ENC:PLAIN,RLE,BIT_PACKED conversion_id: INT64 SNAPPY DO:0 FPO:14828744 SZ:261396/438714/1.68 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED conversions: INT64 SNAPPY DO:0 FPO:15090140 SZ:27433/55397/2.02 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED creative_id: INT32 SNAPPY DO:0 FPO:15117573 SZ:1252055/1404477/1.12 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dd_convs: INT64 SNAPPY DO:0 FPO:16369628 SZ:27225/55243/2.03 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED decisions: FLOAT SNAPPY DO:0 FPO:16396853 SZ:1016802/1271368/1.25 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dmp_liveramp_margin: FLOAT SNAPPY DO:0 FPO:17413655 SZ:117830/176685/1.50 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dmp_liveramp_payout: FLOAT SNAPPY DO:0 FPO:17531485 SZ:117830/176685/1.50 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dmp_nielsen_margin: FLOAT SNAPPY DO:0 FPO:17649315 SZ:53361/92349/1.73 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dmp_nielsen_payout: FLOAT SNAPPY DO:0 FPO:17702676 SZ:53361/92349/1.73 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dmp_rapleaf_margin: FLOAT SNAPPY DO:0 FPO:17756037 SZ:41845/68403/1.63 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED dmp_rapleaf_payout: FLOAT SNAPPY DO:0 FPO:17797882 SZ:41845/68403/1.63 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED e: FLOAT SNAPPY DO:0 FPO:17839727 SZ:107662/204625/1.90 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED expected_cpa: FLOAT SNAPPY DO:0 FPO:17947389 SZ:1576432/1679656/1.07 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED expected_cpc: FLOAT SNAPPY DO:0 FPO:19523821 SZ:85903/100362/1.17 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED expected_payout: FLOAT SNAPPY DO:0 FPO:19609724 SZ:2772807/3076795/1.11 VC:769163 ENC:PLAIN,RLE,BIT_PACKED first_impressions: INT64 SNAPPY DO:0 FPO:22382531 SZ:1000773/1284812/1.28 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED fraud_clicks: INT64 SNAPPY DO:0 FPO:23383304 SZ:1906/2635/1.38 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED fraud_impressions: INT64 SNAPPY DO:0 FPO:23385210 SZ:9825/17177/1.75 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED g: FLOAT SNAPPY DO:0 FPO:23395035 SZ:2749688/2884522/1.05 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED impressions: FLOAT SNAPPY DO:0 FPO:26144723 SZ:1091228/1315488/1.21 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED line_item_id: INT32 SNAPPY DO:0 FPO:27235951 SZ:878489/1182066/1.35 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED mail_type: BINARY SNAPPY DO:0 FPO:28114440 SZ:109886/139143/1.27 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED noads: FLOAT SNAPPY DO:0 FPO:28224326 SZ:9588/13934/1.45 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED predict_version: INT64 SNAPPY DO:0 FPO:28233914 SZ:6222/7272/1.17 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED publisher_id: INT32 SNAPPY DO:0 FPO:28240136 SZ:12955/12932/1.00 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED publisher_revenue: DOUBLE SNAPPY DO:0 FPO:28253091 SZ:2657357/3722991/1.40 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED pvc: INT64 SNAPPY DO:0 FPO:30910448 SZ:5685/9008/1.58 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED second_price: FLOAT SNAPPY DO:0 FPO:30916133 SZ:2263764/2274841/1.00 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED thirdparty_margin: FLOAT SNAPPY DO:0 FPO:33179897 SZ:192433/278050/1.44 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED thirdparty_payout: FLOAT SNAPPY DO:0 FPO:33372330 SZ:192432/278050/1.44 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED >>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] >>> * 57) ).explain() == Physical Plan == *HashAggregate(keys=[], functions=[sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), sum(dd_convs#3845L), ... 33 more fields]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), partial_sum(dd_convs#3845L), ... 33 more fields]) +- *BatchedScan parquet [dd_convs#3845L] Format: ParquetFormat, InputPaths: hdfs://spark-master1.uslicer.net:8020/user/test, PushedFilters: [], ReadSchema: struct<dd_convs:bigint> >>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs'] >>> * 58) ).explain() == Physical Plan == *HashAggregate(keys=[], functions=[sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), sum(dd_convs#4268L), ... 34 more fields]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), partial_sum(dd_convs#4268L), ... 34 more fields]) +- *BatchedScan parquet [dd_convs#4268L] Format: ParquetFormat, InputPaths: hdfs://spark-master1.uslicer.net:8020/user/test, PushedFilters: [], ReadSchema: struct<dd_convs:bigint> >Суббота, 3 сентября 2016, 0:22 +03:00 от Mich Talebzadeh ><mich.talebza...@gmail.com>: > >Since you are using Spark Thrift Server (which in turn uses Hive Thrift >Server) I have this suspicion that it uses Hive optimiser which indicates that >stats do matter. However, that may be just an assumption. > >Have you partitioned these parquet tables? > >Is it worth logging to Hive and run the same queries in Hive with EXPLAIN >EXTENDED select ...... Can you see whether the relevant partition is picked up? > >HTH > >Dr Mich Talebzadeh > >LinkedIn >https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >http://talebzadehmich.wordpress.com > >Disclaimer: Use it at your own risk. Any and all responsibility for any loss, >damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. > >On 2 September 2016 at 12:03, Сергей Романов < romano...@inbox.ru > wrote: >>Hi, Mich, >> >>Column x29 does not seems to be any special. It's a newly created table and I >>did not calculate stats for any columns. Actually, I can sum a single column >>several times in query and face some landshift performance hit at some >>"magic" point. Setting "set spark.sql.codegen.wholeStage=false" makes all >>requests run in a similar slow time (which is slower original time). >>PS. Does Stats even helps for Spark queries? >>SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field >>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS >>`advertiser_id`, SUM(`dd_convs`) AS `dd_convs` FROM >>`slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' GROUP BY >>`advertiser_id` LIMIT 30; >>30 rows selected (1.37 seconds) >>30 rows selected (1.382 seconds) >>30 rows selected (1.399 seconds) >> >>SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY field >>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS >>`advertiser_id`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee` FROM >>`slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' GROUP BY >>`advertiser_id` LIMIT 30; >>30 rows selected (1.379 seconds) >>30 rows selected (1.382 seconds) >>30 rows selected (1.377 seconds) >>SELECT field, SUM(x28) x repeat 40 times FROM parquet_table WHERE partition >>= 1 GROUP BY field -> 1.774s >>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS >>`advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) FROM >>`slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' GROUP BY >>`advertiser_id` LIMIT 30; >>30 rows selected (1.774 seconds) >>30 rows selected (1.721 seconds) >>30 rows selected (2.813 seconds) >>SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE partition = >>1 GROUP BY field -> 7.314s >>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS >>`advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`) FROM `slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' >>GROUP BY `advertiser_id` LIMIT 30; >>30 rows selected (7.314 seconds) >>30 rows selected (7.27 seconds) >>30 rows selected (7.279 seconds) >>SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1 -> >>1.378s >>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) FROM >>`slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28'; >>plan == Physical Plan == >>*HashAggregate(keys=[], functions=[sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), >>sum(dd_convs#159025L), sum(dd_convs#159025L), ... 33 more fields]) >>+- Exchange SinglePartition >> +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), >>partial_sum(dd_convs#159025L), ... 33 more fields]) >> +- *Project [dd_convs#159025L] >> +- *BatchedScan parquet >>slicer.573_slicer_rnd_13[dd_convs#159025L,dt#159005,etl_path#159006] Format: >>ParquetFormat, InputPaths: hdfs:// >>spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. ., >>PushedFilters: [], ReadSchema: struct<dd_convs:bigint> >> >>1 row selected (1.378 seconds) >>1 row selected (1.849 seconds) >>1 row selected (2.641 seconds) >>SELECT SUM(x28) x repeat 58 times FROM parquet_table WHERE partition = 1 -> >>5.733s >>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS >>`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), >>SUM(`dd_convs`) FROM `slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28'; >>plan == Physical Plan == >>*HashAggregate(keys=[], functions=[sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), >>sum(dd_convs#158694L), sum(dd_convs#158694L), ... 34 more fields]) >>+- Exchange SinglePartition >> +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), >>partial_sum(dd_convs#158694L), ... 34 more fields]) >> +- *Project [dd_convs#158694L] >> +- *BatchedScan parquet >>slicer.573_slicer_rnd_13[dd_convs#158694L,dt#158674,etl_path#158675] Format: >>ParquetFormat, InputPaths: hdfs:// >>spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. ., >>PushedFilters: [], ReadSchema: struct<dd_convs:bigint> >> >>1 row selected (5.733 seconds) >>1 row selected (5.693 seconds) >>1 row selected (4.67 seconds) >> >>SET spark.sql.codegen.wholeStage=false; >>Without CodeGen: SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE >>partition = 1 -> 13.712s >>Without CodeGen: SELECT SUM(x28) x repeat 58 times FROM parquet_table WHERE >>partition = 1 -> 13.405s >> >> >>>Четверг, 1 сентября 2016, 19:35 +03:00 от Mich Talebzadeh < >>>mich.talebza...@gmail.com >: >>> >>> >>>What happens if you run the following query on its own. How long it takes? >>> >>>SELECT field, SUM(x29) FROM FROM parquet_table WHERE partition = 1 GROUP BY >>>field >>> >>>Have Stats been updated for all columns in Hive? And the type x29 field? >>> >>>HTH >>> >>> >>>Dr Mich Talebzadeh >>> >>>LinkedIn >>>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>>http://talebzadehmich.wordpress.com >>> >>>Disclaimer: Use it at your own risk. Any and all responsibility for any >>>loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. >>> >>>On 1 September 2016 at 16:55, Сергей Романов < romano...@inbox.ru.invalid > >>>wrote: >>>>Hi, >>>> >>>>When I run a query like "SELECT field, SUM(x1), SUM(x2)... SUM(x28) FROM >>>>parquet_table WHERE partition = 1 GROUP BY field" it runs in under 2 >>>>seconds, but when I add just one more aggregate field to the query "SELECT >>>>field, SUM(x1), SUM(x2)... SUM(x28), SUM(x29) FROM parquet_table WHERE >>>>partition = 1 GROUP BY field" it runs in about 12 seconds. >>>> >>>>Why does it happens? Can I make second query run as fast as first one? I >>>>tried browsing logs in TRACE mode and comparing CODEGEN but everything >>>>looks pretty much the same excluding execution time. >>>> >>>>Can this be related to SPARK-17115 ? >>>> >>>>I'm using Spark 2.0 Thrift Server over YARN/HDFS with partitioned parquet >>>>hive tables. >>>> >>>>Complete example using beeline: >>>> >>>>0: jdbc:hive2://spark-master1.uslicer> DESCRIBE EXTENDED >>>>`slicer`.`573_slicer_rnd_13`; >>>>col_name,data_type,comment >>>>actual_dsp_fee,float,NULL >>>>actual_pgm_fee,float,NULL >>>>actual_ssp_fee,float,NULL >>>>advertiser_id,int,NULL >>>>advertiser_spent,double,NULL >>>>anomaly_clicks,bigint,NULL >>>>anomaly_conversions_filtered,bigint,NULL >>>>anomaly_conversions_unfiltered,bigint,NULL >>>>anomaly_decisions,float,NULL >>>>bid_price,float,NULL >>>>campaign_id,int,NULL >>>>click_prob,float,NULL >>>>clicks,bigint,NULL >>>>clicks_static,bigint,NULL >>>>conv_prob,float,NULL >>>>conversion_id,bigint,NULL >>>>conversions,bigint,NULL >>>>creative_id,int,NULL >>>>dd_convs,bigint,NULL >>>>decisions,float,NULL >>>>dmp_liveramp_margin,float,NULL >>>>dmp_liveramp_payout,float,NULL >>>>dmp_nielsen_margin,float,NULL >>>>dmp_nielsen_payout,float,NULL >>>>dmp_rapleaf_margin,float,NULL >>>>dmp_rapleaf_payout,float,NULL >>>>e,float,NULL >>>>expected_cpa,float,NULL >>>>expected_cpc,float,NULL >>>>expected_payout,float,NULL >>>>first_impressions,bigint,NULL >>>>fraud_clicks,bigint,NULL >>>>fraud_impressions,bigint,NULL >>>>g,float,NULL >>>>impressions,float,NULL >>>>line_item_id,int,NULL >>>>mail_type,string,NULL >>>>noads,float,NULL >>>>predict_version,bigint,NULL >>>>publisher_id,int,NULL >>>>publisher_revenue,double,NULL >>>>pvc,bigint,NULL >>>>second_price,float,NULL >>>>thirdparty_margin,float,NULL >>>>thirdparty_payout,float,NULL >>>>dt,string,NULL >>>>etl_path,string,NULL >>>># Partition Information,, >>>># col_name,data_type,comment >>>>dt,string,NULL >>>>etl_path,string,NULL >>>> >>>> >>>>data_type CatalogTable( >>>> Table: `slicer`.`573_slicer_rnd_13` >>>> Owner: spark >>>> Created: Fri Aug 12 12:30:20 UTC 2016 >>>> Last Access: Thu Jan 01 00:00:00 UTC 1970 >>>> Type: MANAGED >>>> Schema: [`actual_dsp_fee` float, `actual_pgm_fee` float, >>>>`actual_ssp_fee` float, `advertiser_id` int, `advertiser_spent` double, >>>>`anomaly_clicks` bigint, `anomaly_conversions_filtered` bigint, >>>>`anomaly_conversions_unfiltered` bigint, `anomaly_decisions` float, >>>>`bid_price` float, `campaign_id` int, `click_prob` float, `clicks` bigint, >>>>`clicks_static` bigint, `conv_prob` float, `conversion_id` bigint, >>>>`conversions` bigint, `creative_id` int, `dd_convs` bigint, `decisions` >>>>float, `dmp_liveramp_margin` float, `dmp_liveramp_payout` float, >>>>`dmp_nielsen_margin` float, `dmp_nielsen_payout` float, >>>>`dmp_rapleaf_margin` float, `dmp_rapleaf_payout` float, `e` float, >>>>`expected_cpa` float, `expected_cpc` float, `expected_payout` float, >>>>`first_impressions` bigint, `fraud_clicks` bigint, `fraud_impressions` >>>>bigint, `g` float, `impressions` float, `line_item_id` int, `mail_type` >>>>string, `noads` float, `predict_version` bigint, `publisher_id` int, >>>>`publisher_revenue` double, `pvc` bigint, `second_price` float, >>>>`thirdparty_margin` float, `thirdparty_payout` float, `dt` string, >>>>`etl_path` string] >>>> Partition Columns: [`dt`, `etl_path`] >>>> Properties: [transient_lastDdlTime=1471005020] >>>> Storage(Location: hdfs:// >>>>spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer_rnd_13 >>>> , InputFormat: org.apache.hadoop.hive.ql.io >>>>.parquet.MapredParquetInputFormat, OutputFormat: >>>>org.apache.hadoop.hive.ql.io .parquet.MapredParquetOutputFormat, Serde: >>>>org.apache.hadoop.hive.ql.io .parquet.serde.ParquetHiveSerDe, Properties: >>>>[serialization.format=1])) >>>>comment >>>> >>>> >>>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT `advertiser_id` AS >>>>`advertiser_id`, SUM(`conversions`) AS `conversions`, >>>>SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, >>>>SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`, SUM(`fraud_clicks`) AS >>>>`fraud_clicks`, SUM(`impressions`) AS `impressions`, SUM(`conv_prob`) AS >>>>`conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`, >>>>SUM(`decisions`) AS `decisions`, SUM(`fraud_impressions`) AS >>>>`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, >>>>SUM(`actual_ssp_fee`) AS `actual_ssp_fee`, SUM(`dmp_nielsen_margin`) AS >>>>`dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, >>>>SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, >>>>SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, >>>>SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS >>>>`bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, >>>>SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) >>>>AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, >>>>SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS >>>>`dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee` FROM >>>>`slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' GROUP BY >>>>`advertiser_id` LIMIT 30; >>>>plan == Physical Plan == >>>>CollectLimit 30 >>>>+- *HashAggregate(keys=[advertiser_id#13866], >>>>functions=[sum(conversions#13879L), sum(cast(dmp_rapleaf_margin#13887 as >>>>double)), sum(pvc#13904L), sum(cast(dmp_nielsen_payout#13886 as double)), >>>>sum(fraud_clicks#13894L), sum(cast(impressions#13897 as double)), >>>>sum(cast(conv_prob#13877 as double)), sum(cast(dmp_liveramp_payout#13884 as >>>>double)), sum(cast(decisions#13882 as double)), >>>>sum(fraud_impressions#13895L), sum(advertiser_spent#13867), >>>>sum(cast(actual_ssp_fee#13865 as double)), >>>>sum(cast(dmp_nielsen_margin#13885 as double)), >>>>sum(first_impressions#13893L), sum(clicks#13875L), >>>>sum(cast(second_price#13905 as double)), sum(cast(click_prob#13874 as >>>>double)), sum(clicks_static#13876L), sum(cast(expected_payout#13892 as >>>>double)), sum(cast(bid_price#13872 as double)), sum(cast(noads#13900 as >>>>double)), sum(cast(e#13889 as double)), sum(cast(g#13896 as double)), >>>>sum(publisher_revenue#13903), ... 5 more fields]) >>>> +- Exchange hashpartitioning(advertiser_id#13866, 3) >>>> +- *HashAggregate(keys=[advertiser_id#13866], >>>>functions=[partial_sum(conversions#13879L), >>>>partial_sum(cast(dmp_rapleaf_margin#13887 as double)), >>>>partial_sum(pvc#13904L), partial_sum(cast(dmp_nielsen_payout#13886 as >>>>double)), partial_sum(fraud_clicks#13894L), >>>>partial_sum(cast(impressions#13897 as double)), >>>>partial_sum(cast(conv_prob#13877 as double)), >>>>partial_sum(cast(dmp_liveramp_payout#13884 as double)), >>>>partial_sum(cast(decisions#13882 as double)), >>>>partial_sum(fraud_impressions#13895L), partial_sum(advertiser_spent#13867), >>>>partial_sum(cast(actual_ssp_fee#13865 as double)), >>>>partial_sum(cast(dmp_nielsen_margin#13885 as double)), >>>>partial_sum(first_impressions#13893L), partial_sum(clicks#13875L), >>>>partial_sum(cast(second_price#13905 as double)), >>>>partial_sum(cast(click_prob#13874 as double)), >>>>partial_sum(clicks_static#13876L), partial_sum(cast(expected_payout#13892 >>>>as double)), partial_sum(cast(bid_price#13872 as double)), >>>>partial_sum(cast(noads#13900 as double)), partial_sum(cast(e#13889 as >>>>double)), partial_sum(cast(g#13896 as double)), >>>>partial_sum(publisher_revenue#13903), ... 5 more fields]) >>>> +- *Project [actual_dsp_fee#13863, actual_pgm_fee#13864, >>>>actual_ssp_fee#13865, advertiser_id#13866, advertiser_spent#13867, >>>>bid_price#13872, click_prob#13874, clicks#13875L, clicks_static#13876L, >>>>conv_prob#13877, conversions#13879L, dd_convs#13881L, decisions#13882, >>>>dmp_liveramp_margin#13883, dmp_liveramp_payout#13884, >>>>dmp_nielsen_margin#13885, dmp_nielsen_payout#13886, >>>>dmp_rapleaf_margin#13887, dmp_rapleaf_payout#13888, e#13889, >>>>expected_payout#13892, first_impressions#13893L, fraud_clicks#13894L, >>>>fraud_impressions#13895L, ... 6 more fields] >>>> +- *BatchedScan parquet >>>>slicer.573_slicer_rnd_13[actual_dsp_fee#13863,actual_pgm_fee#13864,actual_ssp_fee#13865,advertiser_id#13866,advertiser_spent#13867,bid_price#13872,click_prob#13874,clicks#13875L,clicks_static#13876L,conv_prob#13877,conversions#13879L,dd_convs#13881L,decisions#13882,dmp_liveramp_margin#13883,dmp_liveramp_payout#13884,dmp_nielsen_margin#13885,dmp_nielsen_payout#13886,dmp_rapleaf_margin#13887,dmp_rapleaf_payout#13888,e#13889,expected_payout#13892,first_impressions#13893L,fraud_clicks#13894L,fraud_impressions#13895L,... >>>> 8 more fields] Format: ParquetFormat, InputPaths: hdfs:// >>>>spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. >>>>., PushedFilters: [], ReadSchema: >>>>struct<actual_dsp_fee:float,actual_pgm_fee:float,actual_ssp_fee:float,advertiser_id:int,advertise... >>>> >>>> >>>> >>>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS >>>>`advertiser_id`, SUM(`conversions`) AS `conversions`, >>>>SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, >>>>SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`, SUM(`fraud_clicks`) AS >>>>`fraud_clicks`, SUM(`impressions`) AS `impressions`, SUM(`conv_prob`) AS >>>>`conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`, >>>>SUM(`decisions`) AS `decisions`, SUM(`fraud_impressions`) AS >>>>`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, >>>>SUM(`actual_ssp_fee`) AS `actual_ssp_fee`, SUM(`dmp_nielsen_margin`) AS >>>>`dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, >>>>SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, >>>>SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, >>>>SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS >>>>`bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, >>>>SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) >>>>AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, >>>>SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS >>>>`dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee` FROM >>>>`slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' GROUP BY >>>>`advertiser_id` LIMIT 30; >>>> >>>>(results for three runs) >>>>30 rows selected (11.904 seconds) >>>>30 rows selected (11.703 seconds) >>>>30 rows selected (11.52 seconds) >>>> >>>>XXX >>>> >>>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT `advertiser_id` AS >>>>`advertiser_id`, SUM(`conversions`) AS `conversions`, >>>>SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, >>>>SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`, SUM(`fraud_clicks`) AS >>>>`fraud_clicks`, SUM(`impressions`) AS `impressions`, SUM(`conv_prob`) AS >>>>`conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`, >>>>SUM(`decisions`) AS `decisions`, SUM(`fraud_impressions`) AS >>>>`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, >>>>SUM(`actual_ssp_fee`) AS `actual_ssp_fee`, SUM(`dmp_nielsen_margin`) AS >>>>`dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, >>>>SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, >>>>SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, >>>>SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS >>>>`bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, >>>>SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) >>>>AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, >>>>SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS >>>>`dd_convs` FROM `slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' >>>>GROUP BY `advertiser_id` LIMIT 30; >>>>plan == Physical Plan == >>>>CollectLimit 30 >>>>+- *HashAggregate(keys=[advertiser_id#15269], >>>>functions=[sum(conversions#15282L), sum(cast(dmp_rapleaf_margin#15290 as >>>>double)), sum(pvc#15307L), sum(cast(dmp_nielsen_payout#15289 as double)), >>>>sum(fraud_clicks#15297L), sum(cast(impressions#15300 as double)), >>>>sum(cast(conv_prob#15280 as double)), sum(cast(dmp_liveramp_payout#15287 as >>>>double)), sum(cast(decisions#15285 as double)), >>>>sum(fraud_impressions#15298L), sum(advertiser_spent#15270), >>>>sum(cast(actual_ssp_fee#15268 as double)), >>>>sum(cast(dmp_nielsen_margin#15288 as double)), >>>>sum(first_impressions#15296L), sum(clicks#15278L), >>>>sum(cast(second_price#15308 as double)), sum(cast(click_prob#15277 as >>>>double)), sum(clicks_static#15279L), sum(cast(expected_payout#15295 as >>>>double)), sum(cast(bid_price#15275 as double)), sum(cast(noads#15303 as >>>>double)), sum(cast(e#15292 as double)), sum(cast(g#15299 as double)), >>>>sum(publisher_revenue#15306), ... 4 more fields]) >>>> +- Exchange hashpartitioning(advertiser_id#15269, 3) >>>> +- *HashAggregate(keys=[advertiser_id#15269], >>>>functions=[partial_sum(conversions#15282L), >>>>partial_sum(cast(dmp_rapleaf_margin#15290 as double)), >>>>partial_sum(pvc#15307L), partial_sum(cast(dmp_nielsen_payout#15289 as >>>>double)), partial_sum(fraud_clicks#15297L), >>>>partial_sum(cast(impressions#15300 as double)), >>>>partial_sum(cast(conv_prob#15280 as double)), >>>>partial_sum(cast(dmp_liveramp_payout#15287 as double)), >>>>partial_sum(cast(decisions#15285 as double)), >>>>partial_sum(fraud_impressions#15298L), partial_sum(advertiser_spent#15270), >>>>partial_sum(cast(actual_ssp_fee#15268 as double)), >>>>partial_sum(cast(dmp_nielsen_margin#15288 as double)), >>>>partial_sum(first_impressions#15296L), partial_sum(clicks#15278L), >>>>partial_sum(cast(second_price#15308 as double)), >>>>partial_sum(cast(click_prob#15277 as double)), >>>>partial_sum(clicks_static#15279L), partial_sum(cast(expected_payout#15295 >>>>as double)), partial_sum(cast(bid_price#15275 as double)), >>>>partial_sum(cast(noads#15303 as double)), partial_sum(cast(e#15292 as >>>>double)), partial_sum(cast(g#15299 as double)), >>>>partial_sum(publisher_revenue#15306), ... 4 more fields]) >>>> +- *Project [actual_pgm_fee#15267, actual_ssp_fee#15268, >>>>advertiser_id#15269, advertiser_spent#15270, bid_price#15275, >>>>click_prob#15277, clicks#15278L, clicks_static#15279L, conv_prob#15280, >>>>conversions#15282L, dd_convs#15284L, decisions#15285, >>>>dmp_liveramp_margin#15286, dmp_liveramp_payout#15287, >>>>dmp_nielsen_margin#15288, dmp_nielsen_payout#15289, >>>>dmp_rapleaf_margin#15290, dmp_rapleaf_payout#15291, e#15292, >>>>expected_payout#15295, first_impressions#15296L, fraud_clicks#15297L, >>>>fraud_impressions#15298L, g#15299, ... 5 more fields] >>>> +- *BatchedScan parquet >>>>slicer.573_slicer_rnd_13[actual_pgm_fee#15267,actual_ssp_fee#15268,advertiser_id#15269,advertiser_spent#15270,bid_price#15275,click_prob#15277,clicks#15278L,clicks_static#15279L,conv_prob#15280,conversions#15282L,dd_convs#15284L,decisions#15285,dmp_liveramp_margin#15286,dmp_liveramp_payout#15287,dmp_nielsen_margin#15288,dmp_nielsen_payout#15289,dmp_rapleaf_margin#15290,dmp_rapleaf_payout#15291,e#15292,expected_payout#15295,first_impressions#15296L,fraud_clicks#15297L,fraud_impressions#15298L,g#15299,... >>>> 7 more fields] Format: ParquetFormat, InputPaths: hdfs:// >>>>spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer.. >>>>., PushedFilters: [], ReadSchema: >>>>struct<actual_pgm_fee:float,actual_ssp_fee:float,advertiser_id:int,advertiser_spent:double,bid_pr... >>>> >>>> >>>> >>>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS >>>>`advertiser_id`, SUM(`conversions`) AS `conversions`, >>>>SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`) AS `pvc`, >>>>SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`, SUM(`fraud_clicks`) AS >>>>`fraud_clicks`, SUM(`impressions`) AS `impressions`, SUM(`conv_prob`) AS >>>>`conv_prob`, SUM(`dmp_liveramp_payout`) AS `dmp_liveramp_payout`, >>>>SUM(`decisions`) AS `decisions`, SUM(`fraud_impressions`) AS >>>>`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, >>>>SUM(`actual_ssp_fee`) AS `actual_ssp_fee`, SUM(`dmp_nielsen_margin`) AS >>>>`dmp_nielsen_margin`, SUM(`first_impressions`) AS `first_impressions`, >>>>SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`, >>>>SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, >>>>SUM(`expected_payout`) AS `expected_payout`, SUM(`bid_price`) AS >>>>`bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS `e`, SUM(`g`) AS `g`, >>>>SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`) >>>>AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, >>>>SUM(`dmp_rapleaf_payout`) AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS >>>>`dd_convs` FROM `slicer`.`573_slicer_rnd_13` WHERE dt = '2016-07-28' >>>>GROUP BY `advertiser_id` LIMIT 30; >>>> >>>>(results for three runs) >>>>30 rows selected (2.158 seconds) >>>>30 rows selected (1.83 seconds) >>>>30 rows selected (1.979 seconds) >>>> >>>>Sergei Romanov. >>Sergei Romanov >