performance issue on big table join
Hi Devs I met a performance issue on big table join. The query takes more than 3 hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes cluster. when running query, the left scanner and exchange node are very slow. Did I miss some key arguments? you can see profile file in attachment. -- Thanks & Best Regards eJztnQlcU2e6/31PEtpqW0tbHZh2ZtLOtKMzStm3284drbZ2Wm2rzp3537n39lKIlhkWL4Euc+feG9nEsAhhESIiSwhbCAEia1hkCQgBwyIgIAJCZDcEDIuI//ckYUtyQoJYN/i0x5OzvOfdn+f9/t73HMO5VGCw60sPktsPxB2ODh+Ympqb2Jla2ptZmXxtY3LMxPbYMUuHr43tjBf+dm7TM9xloP+Ji529u+O3pKOu7nZORx2dSds2b9+0yeCFlb+3A/03gN+r+/bv/ePHxIOf79tP/NOew4c+OfSxs/yJJ9xcjzk6kYj2biQ7d5ID8btv0F9uHi4uji7HiXZE+Y1fezg6ORBdjxE/cT5h52RnRPwjmUQ8vP+z/XuO7JefJBPdXYnOJDuyhxuJ+F/yoElux1zdnO1c7ElGhgZqovAGjO5zRzycne3cfti2yfB9g1986Op8wsOddMTezuWwnctx0h4y2fG4izPJxR1NkBtMUQC7x3STLok39Htlj4OzIwzI1YXoRiJ7OLm/jh5wR1Pr6OxMcnCECXf64eUPXV1cSPboUZg4N7ybq6v7lg9dXd0cHF3s3F3dtn7tcIJkbWbmYm5ragpLYfM+0jE7GBhx39dvuZ+wdyB/9bWjy1cn7NzcHd3ho0gO6D7MCPevTIyNX95HciIdt1sIfNPz+10ciGg0f2ZqbGK128R4t6kl0dTY1tTE1sTGyNzK2swS3mWst9/NzdWNvGn7frK7o7Ps9i9IbrsPuJLdiQdJzi9YmZtYWRhbmVhv2f89yV6Rk41bNn8Osx6NM3HF3y/R+8hE4p5vj8ueTSQetPtesQfPHnb9Dp6ETzJa2IcPs/sb+iDF4aWf+0judo5ORF3+Nu9+DP82G1vZHtx/+GNYGXfv//OHB/Yc+nj/QoRN5P+YfWPqrLwHy2YxXcv3ict2FT//eOiLPYePfnL0k88P7d+nZU4Zm9se/fyL3YfUnLNAn2hkY2rljJaPqZGJsaVsDz2lEikTSyP476d7F46aGpmZL/up099mYzPbPR9/fHj/x3uO7lc6ZyEL3NTZwkoWFVMzZzMzuGdqbA6f/6n80dbWcP8g3DODkbIkHlTEwsQUPS77+dEnh/Z89sm/KgeuOVKmtgf2HDlA/MPnnyjnloX8ud+YGTsv7FnDPRNzYyMTy71KkUJz0oz4sSJS5tZmRmY2skh9cujQ/sOy4HcRtSzIzf/YvdvY0la5Oi1EytSGbGZigRYa3LM2tkH3TMzNFyKyYl/2Z7wyePgTTfIOe7KRPfkrR3eS81fkv+1cNaf+Ae80sT3y4Z5DxAP7PjqiFCkTc7KFhbksKjZkC2Nz7EiZwh3TxeKDWbVQfKv2gUb2drBvdj3+FdnOiUQm2pNh8VmozybiQvFZOZsay+qUiamziTV5RZGt2NeUU2SyEVmHnDI2Vp9NikiZkq1NZPlDNCMbw8qFFSlTY2sja5O15RQZ9t0kRT6R0Ue9LLe7xH8huaF27F8cZT8diN/Kf6PV18TIePeRQ3u+OHLg86MKq71Dbrat7RwsHYzNzEyMrU0tzE2sv7a0szNxMDe1gmbsa2Orr+2tv4b+xdf2O7ceIrl/5+r2N+IeBwdoKck/tYW+x7FjtibGRqZmqFmC3Y2prbm1sYWpAbRDRLkdcnQhHiZBo/atHZqScIDaSUsTExcTuZ3cYalI+E7iwhlTzDNmmGfM1Z6xMjXFOLPMUu8w3knEf+Fk5+K/TcUIbUZN4KJNhelw9XCzJy1PkC1q8VzdfvhAFv7BvZtVr16wzuTFa62MrWAPAi9WeDq2xKPfkIjHXJ2cXL9D/Sp3u6/RsrWDvpLMNYGH3KCH8K2di7v8HNHOxeE9aMDtXZ08nF2IZHcYF/gU2Og369jQdulW3zZv/gj29F98BhvAR4f3fHxw/6GjxL+sMGL/TvwG9SM+MCE6usB4Qe8O7qP9i0rGwPxwJjnvJimy5wPjvbLfbkuZCw9tRp+1+/DnsNYe+eTQp2hAWtz0D7WWWymiaFjQf4M19esfbIlLfQBxz5EP0XNOjtANtEWNpbZPhY3X44QTabejA/kDM6Kb63e7yY5/J31gar6XaG8n8xSdHN1/+EAWIsxIU+WMVNMfLWSnxbLstFCtZMp5CduErIapRHOhmi46EX/57JODnxxF47Rqhqx4AuotfLoO2bDCcfjLgpWXx8XDHbr7MCYeztCefeUCq6NsPOK+c5fsGHnFMfSO426uHieUYq8Sc7lToSZvzMzlx8knHJ2cdn/tcewYyQ2mU5ZfKxJlip0oU2srGxsrY1NzNGnL3Y+/YPgKspR+Y0f+Bo614GDDXt5RrKgExA+IK8w5esexv7134m+w/bv81cPF3h3eYUeGWQJHALITstxz/waW5PJA5YEsZRnxd4rnLB1Cb4SjO3fU64cDP3doQGyJhz+C3STx/d2qsViRr3K/6EHy1XiXyVLOmpli56ySD/UXNS6PLF+1brcrorEsEqgvuKLOmpuhkbC0kN2Ebj6CfpNSMz6859C+zw+qb7nadITmGPVTHtUVftpfdPerdhEV8UMDg3+L98GEW5tZvYdu0LJH+20bM0uiPB+sjIzNPt6ruAW1N+TdMJPISxlCJH3v7mZ3wtUJHYfKzzk4klFj5aC4S265ZPfaEmUXeLjYfQsHivLj6GOWHVDctMzCwbtUz+uQcUTdy3iFC6q2e16LaTDGfvbySg5bnS4160Gq1QqnVos6tdIDXapRytXJ1Pw9dLNYnUzN5OVsamxkbQOrExGzt9n9u5V94OaV1W4xnzRVO93rnMYKp0NtI+pc3Mqep7Gcv31+QpaVxB1kkju0a2iHf8zxuIeb7EmoF0g8AT1XF5LbzucPHv1q3+dffGD8Ky3u3Llpi/yqIzCZpOeh1f3kyIH9+15cOuhBRj7/dLP899EfTpAIX/5x/+H/9+JhEqwDsI594erq9KKDHHHtPgF/bD5CkjO0T/a9ZWNlbk4ysTG2PwZHDzaW5rZ2JtYkC0sSycrYxMzM1MbhxYWL0ZCf27t//5E/7fnzS0f+y0keHxTodQEy9Hjt3ZcZ8l1aOgLEY26uzsTNynVUuR/c/B20jyTNllaWw9qYzQXHY7nfseBKLT8m8yllLiVMqZu7CuQzsbI1sba1sDSytrawlg1PXj8qHwwcVAwE0Bwi/7uuXr5OTRq/SDgNpw2XFeyKYluWgEVcuay+LK9fK6qV0lhV9iwluKqEQ5XHncvQ6so6s5zHatUKdG5ksnEiBm5VW1CYA+EVLUkFQS8HtgrK+wagblqVgBvuMnjlC3lcZQXi5OhCMmQG+Hj/i7fXKW+zoBs9Z8yYSfSR//S9wabYxVwrjbJrTxwOto8JSgizl06FnrWPOTfUb2+YvO0gyd3OAVYhopOrnQPaG7rB9G5fefSYo4sj+RuSg04X6++Bfd8PZEfy4pE39/+Xh+O3sOJBQ0a0d4KOLNo+5Ql1MDgC8xJ21S6uDiRZMSxIEQaHVxqOpTv2waGwm+PX6P6KO/RlOYMWzcKTNxm8rKixi3kVIdGbG6kbd5i/kZdDoqZfyyEJr2ZcJXV3Rl4/ltc3GkXBseISh+7M4UtrzjOm5/D1ubyr8B/D5K3ykMgeX8vFA9WnvX5Edg6O892IdgsF/hpaok4kNK6Lx948TLJz+AFVTGRZSYTVwpL4tZ3930guDmTrPU5O8CcJVhAPWf1cOEHcYWIJuz6742htWPIMdi6Ux88+cnSD9c3hBxc7Z0d7RbbB+mZPcvyW5PCyjOkvGjt9+cXQcBGPkdztYexf+aOLG+k4zFp4k0zA2bTJ4EV5Wz4CKzXJbdsmw88MXv/QyRE+/iP0nj/ZOS7qMrGUs9cRbGHmJ/DpB2EZuTlCu/h3WQNZuLO0bPYltcrNJsPn0TaxQ+0jsQLcZGC2fzHnvlCIW6tKattegUnb+pG8oh1W5Ni2Tds3wxR9JLPlK+O8SUsJSlpFTw2ew8uSQ/Az3CsvSXmBQXOC+jQu9o4kcij40BXWdlui5S6iM+xH3iM6231vS7SwdibDfRMTS2do30wt3L8hvrMbJgheZ4YesTBedsQCPWK17BoTE1P0kI2x6qEVV8kCt7ExslE6+PpHihoEh+mL0KoT2QwNBZF4xM3eiHgIbbPEo8fd5bs7YF0kHrVzOw57WXdoIuBIaMEWLdRG2KG6OKCB7dj//QmZQYC3yKuinZsbmutE4lKDIRL3u8g9vR9HDlqJBlcAZVNltCz7O7z/4OdKcsgxOycySYUb77BYAZ4PvbdHZdfdzYP0qqyyLWTWRoavNcONl2W4hQnZwlKuksFdK0tzhWAmu0//kIfz1zCbXI8tmBlg8lPUpMtyekHwdIX9/h/JdsdJZ1XotimqFH2sBm6rnDBbOmG94oS5uhPLyfaKoJaDbVTgM5Ehb8M/q6ZEbfvF7oPU1D0NOSHrmZV7TNW+cpPBtj3QdMAbHIgfLZiuj4zNtj1vOKdv8Jri3NFvoAF3OOoKI0be9iLai76y18nV1Vke/N4f3EnkbXqaOt2taPzk0ZPFDl5NofxMcXyZUyYP5TX0KHQ8HdSdcUOdOUVwC2FxvU++Ao0hakGhSXHwsJcZh2oA7ZosIgr/VZENS9F6ffnZIzDXlU8dgY4sfMSCVZOfeleeOFme
Re: performance issue on big table join
The profile file is damaged. Here is a screenshot for exec summary 2017-10-27 10:04 GMT+08:00 俊杰陈 : > Hi Devs > > I met a performance issue on big table join. The query takes more than 3 > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > cluster. when running query, the left scanner and exchange node are very > slow. Did I miss some key arguments? > > you can see profile file in attachment. > > > > -- > Thanks & Best Regards > -- Thanks & Best Regards
Re: performance issue on big table join
Hi, Looks like you are joining store_sales with catalog_sales on item_sk, this kind of join condition is a many to many, which means the output number of rows will be much larger then input number of rows, not sure if this is intended. Also did you run "compute stats [TABLE_NAME]" on both tables? For a more comprehensive query try TPCDS Q17 select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,avg(ss_quantity) as store_sales_quantityave ,stddev_samp(ss_quantity) as store_sales_quantitystdev ,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov ,count(sr_return_quantity) as store_returns_quantitycount ,avg(sr_return_quantity) as store_returns_quantityave ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as store_returns_quantitycov ,count(cs_quantity) as catalog_sales_quantitycount ,avg(cs_quantity) as catalog_sales_quantityave ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev ,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov from store_sales ,store_returns ,catalog_sales ,date_dim d1 ,date_dim d2 ,date_dim d3 ,store ,item where d1.d_quarter_name = '2000Q1' and d1.d_date_sk = ss_sold_date_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_customer_sk = sr_customer_sk and ss_item_sk = sr_item_sk and ss_ticket_number = sr_ticket_number and sr_returned_date_sk = d2.d_date_sk and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') and sr_customer_sk = cs_bill_customer_sk and sr_item_sk = cs_item_sk and cs_sold_date_sk = d3.d_date_sk and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') group by i_item_id ,i_item_desc ,s_state order by i_item_id ,i_item_desc ,s_state limit 100; I recommend moving this kind of discussion on u...@impala.incubator.apache.org. On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 wrote: > The profile file is damaged. Here is a screenshot for exec summary > > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 : > >> Hi Devs >> >> I met a performance issue on big table join. The query takes more than 3 >> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes >> cluster. when running query, the left scanner and exchange node are very >> slow. Did I miss some key arguments? >> >> you can see profile file in attachment. >> >> >> >> -- >> Thanks & Best Regards >> > > > > -- > Thanks & Best Regards >
Re: performance issue on big table join
Hi JJ Consider it only takes 3mins on SparkSQL, maybe there are some mistakes in query options. Try run "set;" in impala-shell and check all query options, e.g: BATCH_SIZE: [0] DISABLE_CODEGEN: [0] RUNTIME_FILTER_MODE: GLOBAL Just a guess, thanks. 在 27/10/2017 10:25, 俊杰陈 写道: The profile file is damaged. Here is a screenshot for exec summary [cid:ii_j999ymep1_15f5ba563aeabb91] 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjjnj...@gmail.com>>: Hi Devs I met a performance issue on big table join. The query takes more than 3 hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes cluster. when running query, the left scanner and exchange node are very slow. Did I miss some key arguments? you can see profile file in attachment. [cid:ii_j9998pph2_15f5b92f2cf47020] -- Thanks & Best Regards -- Thanks & Best Regards -- Regards, Hongxu.
Re: performance issue on big table join
Thanks Hongxu, Here are configurations on my cluster, most of them are default values. Which item do you think it may impact? ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] ABORT_ON_ERROR: [0] ALLOW_UNSUPPORTED_FORMATS: [0] APPX_COUNT_DISTINCT: [0] BATCH_SIZE: [0] COMPRESSION_CODEC: [NONE] DEBUG_ACTION: [] DEFAULT_ORDER_BY_LIMIT: [-1] DISABLE_CACHED_READS: [0] DISABLE_CODEGEN: [0] DISABLE_OUTERMOST_TOPN: [0] DISABLE_ROW_RUNTIME_FILTERING: [0] DISABLE_STREAMING_PREAGGREGATIONS: [0] DISABLE_UNSAFE_SPILLS: [0] ENABLE_EXPR_REWRITES: [1] EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] EXPLAIN_LEVEL: [1] HBASE_CACHE_BLOCKS: [0] HBASE_CACHING: [0] MAX_BLOCK_MGR_MEMORY: [0] MAX_ERRORS: [100] MAX_IO_BUFFERS: [0] MAX_NUM_RUNTIME_FILTERS: [10] MAX_SCAN_RANGE_LENGTH: [0] MEM_LIMIT: [0] MT_DOP: [0] NUM_NODES: [0] NUM_SCANNER_THREADS: [0] OPTIMIZE_PARTITION_KEY_SCANS: [0] PARQUET_ANNOTATE_STRINGS_UTF8: [0] PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] PARQUET_FILE_SIZE: [0] PREFETCH_MODE: [1] QUERY_TIMEOUT_S: [0] REPLICA_PREFERENCE: [0] REQUEST_POOL: [] RESERVATION_REQUEST_TIMEOUT: [0] RM_INITIAL_MEM: [0] RUNTIME_BLOOM_FILTER_SIZE: [1048576] RUNTIME_FILTER_MAX_SIZE: [16777216] RUNTIME_FILTER_MIN_SIZE: [1048576] RUNTIME_FILTER_MODE: [2] RUNTIME_FILTER_WAIT_TIME_MS: [0] S3_SKIP_INSERT_STAGING: [1] SCAN_NODE_CODEGEN_THRESHOLD: [180] SCHEDULE_RANDOM_REPLICA: [0] SCRATCH_LIMIT: [-1] SEQ_COMPRESSION_MODE: [0] STRICT_MODE: [0] SUPPORT_START_OVER: [false] SYNC_DDL: [0] V_CPU_CORES: [0] 2017-10-31 15:30 GMT+08:00 Hongxu Ma : > Hi JJ > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes in > query options. > Try run "set;" in impala-shell and check all query options, e.g: > BATCH_SIZE: [0] > DISABLE_CODEGEN: [0] > RUNTIME_FILTER_MODE: GLOBAL > > Just a guess, thanks. > > 在 27/10/2017 10:25, 俊杰陈 写道: > The profile file is damaged. Here is a screenshot for exec summary > [cid:ii_j999ymep1_15f5ba563aeabb91] > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj > nj...@gmail.com>>: > Hi Devs > > I met a performance issue on big table join. The query takes more than 3 > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > cluster. when running query, the left scanner and exchange node are very > slow. Did I miss some key arguments? > > you can see profile file in attachment. > > [cid:ii_j9998pph2_15f5b92f2cf47020] > > -- > Thanks & Best Regards > > > > -- > Thanks & Best Regards > > > -- > Regards, > Hongxu. > -- Thanks & Best Regards
Re: performance issue on big table join
Attaching the query profile will be most helpful to investigate this issue. If you can capture the profile from the WebUI on the coordinator node it would be great. On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 wrote: > Thanks Hongxu, > > Here are configurations on my cluster, most of them are default values. > Which item do you think it may impact? > > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] > ABORT_ON_ERROR: [0] > ALLOW_UNSUPPORTED_FORMATS: [0] > APPX_COUNT_DISTINCT: [0] > BATCH_SIZE: [0] > COMPRESSION_CODEC: [NONE] > DEBUG_ACTION: [] > DEFAULT_ORDER_BY_LIMIT: [-1] > DISABLE_CACHED_READS: [0] > DISABLE_CODEGEN: [0] > DISABLE_OUTERMOST_TOPN: [0] > DISABLE_ROW_RUNTIME_FILTERING: [0] > DISABLE_STREAMING_PREAGGREGATIONS: [0] > DISABLE_UNSAFE_SPILLS: [0] > ENABLE_EXPR_REWRITES: [1] > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] > EXPLAIN_LEVEL: [1] > HBASE_CACHE_BLOCKS: [0] > HBASE_CACHING: [0] > MAX_BLOCK_MGR_MEMORY: [0] > MAX_ERRORS: [100] > MAX_IO_BUFFERS: [0] > MAX_NUM_RUNTIME_FILTERS: [10] > MAX_SCAN_RANGE_LENGTH: [0] > MEM_LIMIT: [0] > MT_DOP: [0] > NUM_NODES: [0] > NUM_SCANNER_THREADS: [0] > OPTIMIZE_PARTITION_KEY_SCANS: [0] > PARQUET_ANNOTATE_STRINGS_UTF8: [0] > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] > PARQUET_FILE_SIZE: [0] > PREFETCH_MODE: [1] > QUERY_TIMEOUT_S: [0] > REPLICA_PREFERENCE: [0] > REQUEST_POOL: [] > RESERVATION_REQUEST_TIMEOUT: [0] > RM_INITIAL_MEM: [0] > RUNTIME_BLOOM_FILTER_SIZE: [1048576] > RUNTIME_FILTER_MAX_SIZE: [16777216] > RUNTIME_FILTER_MIN_SIZE: [1048576] > RUNTIME_FILTER_MODE: [2] > RUNTIME_FILTER_WAIT_TIME_MS: [0] > S3_SKIP_INSERT_STAGING: [1] > SCAN_NODE_CODEGEN_THRESHOLD: [180] > SCHEDULE_RANDOM_REPLICA: [0] > SCRATCH_LIMIT: [-1] > SEQ_COMPRESSION_MODE: [0] > STRICT_MODE: [0] > SUPPORT_START_OVER: [false] > SYNC_DDL: [0] > V_CPU_CORES: [0] > > 2017-10-31 15:30 GMT+08:00 Hongxu Ma : > > > Hi JJ > > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes > in > > query options. > > Try run "set;" in impala-shell and check all query options, e.g: > > BATCH_SIZE: [0] > > DISABLE_CODEGEN: [0] > > RUNTIME_FILTER_MODE: GLOBAL > > > > Just a guess, thanks. > > > > 在 27/10/2017 10:25, 俊杰陈 写道: > > The profile file is damaged. Here is a screenshot for exec summary > > [cid:ii_j999ymep1_15f5ba563aeabb91] > > > > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj > > nj...@gmail.com>>: > > Hi Devs > > > > I met a performance issue on big table join. The query takes more than 3 > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > > cluster. when running query, the left scanner and exchange node are very > > slow. Did I miss some key arguments? > > > > you can see profile file in attachment. > > > > [cid:ii_j9998pph2_15f5b92f2cf47020] > > > > -- > > Thanks & Best Regards > > > > > > > > -- > > Thanks & Best Regards > > > > > > -- > > Regards, > > Hongxu. > > > > > > -- > Thanks & Best Regards >
Re: performance issue on big table join
Hi Mostafa Cheng already put the profile in thread. Here is another profile for impala release version. you can also see the attachment. 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar : > Attaching the query profile will be most helpful to investigate this issue. > > If you can capture the profile from the WebUI on the coordinator node it > would be great. > > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 wrote: > > > Thanks Hongxu, > > > > Here are configurations on my cluster, most of them are default values. > > Which item do you think it may impact? > > > > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] > > ABORT_ON_ERROR: [0] > > ALLOW_UNSUPPORTED_FORMATS: [0] > > APPX_COUNT_DISTINCT: [0] > > BATCH_SIZE: [0] > > COMPRESSION_CODEC: [NONE] > > DEBUG_ACTION: [] > > DEFAULT_ORDER_BY_LIMIT: [-1] > > DISABLE_CACHED_READS: [0] > > DISABLE_CODEGEN: [0] > > DISABLE_OUTERMOST_TOPN: [0] > > DISABLE_ROW_RUNTIME_FILTERING: [0] > > DISABLE_STREAMING_PREAGGREGATIONS: [0] > > DISABLE_UNSAFE_SPILLS: [0] > > ENABLE_EXPR_REWRITES: [1] > > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] > > EXPLAIN_LEVEL: [1] > > HBASE_CACHE_BLOCKS: [0] > > HBASE_CACHING: [0] > > MAX_BLOCK_MGR_MEMORY: [0] > > MAX_ERRORS: [100] > > MAX_IO_BUFFERS: [0] > > MAX_NUM_RUNTIME_FILTERS: [10] > > MAX_SCAN_RANGE_LENGTH: [0] > > MEM_LIMIT: [0] > > MT_DOP: [0] > > NUM_NODES: [0] > > NUM_SCANNER_THREADS: [0] > > OPTIMIZE_PARTITION_KEY_SCANS: [0] > > PARQUET_ANNOTATE_STRINGS_UTF8: [0] > > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] > > PARQUET_FILE_SIZE: [0] > > PREFETCH_MODE: [1] > > QUERY_TIMEOUT_S: [0] > > REPLICA_PREFERENCE: [0] > > REQUEST_POOL: [] > > RESERVATION_REQUEST_TIMEOUT: [0] > > RM_INITIAL_MEM: [0] > > RUNTIME_BLOOM_FILTER_SIZE: [1048576] > > RUNTIME_FILTER_MAX_SIZE: [16777216] > > RUNTIME_FILTER_MIN_SIZE: [1048576] > > RUNTIME_FILTER_MODE: [2] > > RUNTIME_FILTER_WAIT_TIME_MS: [0] > > S3_SKIP_INSERT_STAGING: [1] > > SCAN_NODE_CODEGEN_THRESHOLD: [180] > > SCHEDULE_RANDOM_REPLICA: [0] > > SCRATCH_LIMIT: [-1] > > SEQ_COMPRESSION_MODE: [0] > > STRICT_MODE: [0] > > SUPPORT_START_OVER: [false] > > SYNC_DDL: [0] > > V_CPU_CORES: [0] > > > > 2017-10-31 15:30 GMT+08:00 Hongxu Ma : > > > > > Hi JJ > > > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes > > in > > > query options. > > > Try run "set;" in impala-shell and check all query options, e.g: > > > BATCH_SIZE: [0] > > > DISABLE_CODEGEN: [0] > > > RUNTIME_FILTER_MODE: GLOBAL > > > > > > Just a guess, thanks. > > > > > > 在 27/10/2017 10:25, 俊杰陈 写道: > > > The profile file is damaged. Here is a screenshot for exec summary > > > [cid:ii_j999ymep1_15f5ba563aeabb91] > > > > > > > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj > > > nj...@gmail.com>>: > > > Hi Devs > > > > > > I met a performance issue on big table join. The query takes more than > 3 > > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > > > cluster. when running query, the left scanner and exchange node are > very > > > slow. Did I miss some key arguments? > > > > > > you can see profile file in attachment. > > > > > > [cid:ii_j9998pph2_15f5b92f2cf47020] > > > > > > -- > > > Thanks & Best Regards > > > > > > > > > > > > -- > > > Thanks & Best Regards > > > > > > > > > -- > > > Regards, > > > Hongxu. > > > > > > > > > > > -- > > Thanks & Best Regards > > > -- Thanks & Best Regards Query (id=db497c13276e70de:38c671cf): Summary: Session ID: ee4f844616a8170f:5cf00c4759ee93af Session Type: BEESWAX Start Time: 2017-11-01 15:57:09.150268000 End Time: 2017-11-01 18:56:53.614915000 Query Type: QUERY Query State: FINISHED Query Status: OK Impala Version: impalad version 2.8.0-SNAPSHOT RELEASE (build 4a61dbd636d052c97b829617bb39b970d834
Re: performance issue on big table join
+user list 2017-11-02 9:57 GMT+08:00 俊杰陈 : > Hi Mostafa > > Cheng already put the profile in thread. > > Here is another profile for impala release version. you can also see the > attachment. > > > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar : > >> Attaching the query profile will be most helpful to investigate this >> issue. >> >> If you can capture the profile from the WebUI on the coordinator node it >> would be great. >> >> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 wrote: >> >> > Thanks Hongxu, >> > >> > Here are configurations on my cluster, most of them are default values. >> > Which item do you think it may impact? >> > >> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] >> > ABORT_ON_ERROR: [0] >> > ALLOW_UNSUPPORTED_FORMATS: [0] >> > APPX_COUNT_DISTINCT: [0] >> > BATCH_SIZE: [0] >> > COMPRESSION_CODEC: [NONE] >> > DEBUG_ACTION: [] >> > DEFAULT_ORDER_BY_LIMIT: [-1] >> > DISABLE_CACHED_READS: [0] >> > DISABLE_CODEGEN: [0] >> > DISABLE_OUTERMOST_TOPN: [0] >> > DISABLE_ROW_RUNTIME_FILTERING: [0] >> > DISABLE_STREAMING_PREAGGREGATIONS: [0] >> > DISABLE_UNSAFE_SPILLS: [0] >> > ENABLE_EXPR_REWRITES: [1] >> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] >> > EXPLAIN_LEVEL: [1] >> > HBASE_CACHE_BLOCKS: [0] >> > HBASE_CACHING: [0] >> > MAX_BLOCK_MGR_MEMORY: [0] >> > MAX_ERRORS: [100] >> > MAX_IO_BUFFERS: [0] >> > MAX_NUM_RUNTIME_FILTERS: [10] >> > MAX_SCAN_RANGE_LENGTH: [0] >> > MEM_LIMIT: [0] >> > MT_DOP: [0] >> > NUM_NODES: [0] >> > NUM_SCANNER_THREADS: [0] >> > OPTIMIZE_PARTITION_KEY_SCANS: [0] >> > PARQUET_ANNOTATE_STRINGS_UTF8: [0] >> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] >> > PARQUET_FILE_SIZE: [0] >> > PREFETCH_MODE: [1] >> > QUERY_TIMEOUT_S: [0] >> > REPLICA_PREFERENCE: [0] >> > REQUEST_POOL: [] >> > RESERVATION_REQUEST_TIMEOUT: [0] >> > RM_INITIAL_MEM: [0] >> > RUNTIME_BLOOM_FILTER_SIZE: [1048576] >> > RUNTIME_FILTER_MAX_SIZE: [16777216] >> > RUNTIME_FILTER_MIN_SIZE: [1048576] >> > RUNTIME_FILTER_MODE: [2] >> > RUNTIME_FILTER_WAIT_TIME_MS: [0] >> > S3_SKIP_INSERT_STAGING: [1] >> > SCAN_NODE_CODEGEN_THRESHOLD: [180] >> > SCHEDULE_RANDOM_REPLICA: [0] >> > SCRATCH_LIMIT: [-1] >> > SEQ_COMPRESSION_MODE: [0] >> > STRICT_MODE: [0] >> > SUPPORT_START_OVER: [false] >> > SYNC_DDL: [0] >> > V_CPU_CORES: [0] >> > >> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma : >> > >> > > Hi JJ >> > > Consider it only takes 3mins on SparkSQL, maybe there are some >> mistakes >> > in >> > > query options. >> > > Try run "set;" in impala-shell and check all query options, e.g: >> > > BATCH_SIZE: [0] >> > > DISABLE_CODEGEN: [0] >> > > RUNTIME_FILTER_MODE: GLOBAL >> > > >> > > Just a guess, thanks. >> > > >> > > 在 27/10/2017 10:25, 俊杰陈 写道: >> > > The profile file is damaged. Here is a screenshot for exec summary >> > > [cid:ii_j999ymep1_15f5ba563aeabb91] >> > > >> > > >> > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj >> > > nj...@gmail.com>>: >> > > Hi Devs >> > > >> > > I met a performance issue on big table join. The query takes more >> than 3 >> > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes >> > > cluster. when running query, the left scanner and exchange node are >> very >> > > slow. Did I miss some key arguments? >> > > >> > > you can see profile file in attachment. >> > > >> > > [cid:ii_j9998pph2_15f5b92f2cf47020] >> > > >> > > -- >> > > Thanks & Best Regards >> > > >> > > >> > > >> > > -- >> > > Thanks & Best Regards >> > > >> > > >> > > -- >> > > Regards, >> > > Hongxu. >> > > >> > >> > >> > >> > -- >> > Thanks & Best Regards >> > >> > > > > -- > Thanks & Best Regards > -- Thanks & Best Regards
RE: performance issue on big table join
Thanks Junjie for the profile file. Another observation is that Impala has a much lower CPU utilization than Spark SQL (Impala is on the left and Spark SQL is on the right). [cid:image001.jpg@01D353C5.E69BA8A0] I think the major reason here might be that Impala is single thread based execution for non-scanner operator (join, aggregation) in single node [1]. Do you have any suggestion that we can improve its performance especially improving its CPU utilization? Please correct me if I miss anything here. [1] https://issues.apache.org/jira/browse/IMPALA-3902 Best Regards Ferdinand Xu From: 俊杰陈 [mailto:cjjnj...@gmail.com] Sent: Thursday, November 2, 2017 10:00 AM To: dev@impala ; u...@impala.incubator.apache.org Subject: Re: performance issue on big table join +user list 2017-11-02 9:57 GMT+08:00 俊杰陈 mailto:cjjnj...@gmail.com>>: Hi Mostafa Cheng already put the profile in thread. Here is another profile for impala release version. you can also see the attachment. 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar mailto:mmokh...@cloudera.com>>: Attaching the query profile will be most helpful to investigate this issue. If you can capture the profile from the WebUI on the coordinator node it would be great. On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 mailto:cjjnj...@gmail.com>> wrote: > Thanks Hongxu, > > Here are configurations on my cluster, most of them are default values. > Which item do you think it may impact? > > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] > ABORT_ON_ERROR: [0] > ALLOW_UNSUPPORTED_FORMATS: [0] > APPX_COUNT_DISTINCT: [0] > BATCH_SIZE: [0] > COMPRESSION_CODEC: [NONE] > DEBUG_ACTION: [] > DEFAULT_ORDER_BY_LIMIT: [-1] > DISABLE_CACHED_READS: [0] > DISABLE_CODEGEN: [0] > DISABLE_OUTERMOST_TOPN: [0] > DISABLE_ROW_RUNTIME_FILTERING: [0] > DISABLE_STREAMING_PREAGGREGATIONS: [0] > DISABLE_UNSAFE_SPILLS: [0] > ENABLE_EXPR_REWRITES: [1] > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] > EXPLAIN_LEVEL: [1] > HBASE_CACHE_BLOCKS: [0] > HBASE_CACHING: [0] > MAX_BLOCK_MGR_MEMORY: [0] > MAX_ERRORS: [100] > MAX_IO_BUFFERS: [0] > MAX_NUM_RUNTIME_FILTERS: [10] > MAX_SCAN_RANGE_LENGTH: [0] > MEM_LIMIT: [0] > MT_DOP: [0] > NUM_NODES: [0] > NUM_SCANNER_THREADS: [0] > OPTIMIZE_PARTITION_KEY_SCANS: [0] > PARQUET_ANNOTATE_STRINGS_UTF8: [0] > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] > PARQUET_FILE_SIZE: [0] > PREFETCH_MODE: [1] > QUERY_TIMEOUT_S: [0] > REPLICA_PREFERENCE: [0] > REQUEST_POOL: [] > RESERVATION_REQUEST_TIMEOUT: [0] > RM_INITIAL_MEM: [0] > RUNTIME_BLOOM_FILTER_SIZE: [1048576] > RUNTIME_FILTER_MAX_SIZE: [16777216] > RUNTIME_FILTER_MIN_SIZE: [1048576] > RUNTIME_FILTER_MODE: [2] > RUNTIME_FILTER_WAIT_TIME_MS: [0] > S3_SKIP_INSERT_STAGING: [1] > SCAN_NODE_CODEGEN_THRESHOLD: [180] > SCHEDULE_RANDOM_REPLICA: [0] > SCRATCH_LIMIT: [-1] > SEQ_COMPRESSION_MODE: [0] > STRICT_MODE: [0] > SUPPORT_START_OVER: [false] > SYNC_DDL: [0] > V_CPU_CORES: [0] > > 2017-10-31 15:30 GMT+08:00 Hongxu Ma > mailto:inte...@outlook.com>>: > > > Hi JJ > > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes > in > > query options. > > Try run "set;" in impala-shell and check all query options, e.g: > > BATCH_SIZE: [0] > > DISABLE_CODEGEN: [0] > > RUNTIME_FILTER_MODE: GLOBAL > > > > Just a guess, thanks. > > > > 在 27/10/2017 10:25, 俊杰陈 写道: > > The profile file is damaged. Here is a screenshot for exec summary > > [cid:ii_j999ymep1_15f5ba563aeabb91] > > > > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 > > mailto:cjjnj...@gmail.com><mailto:cjj<mailto:cjj> > > nj...@gmail.com<mailto:nj...@gmail.com>>>: > > Hi Devs > > > > I met a performance issue on big table join. The query takes more than 3 > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > > cluster. when running query, the left scanner and exchange node are very > > slow. Did I miss some key arguments? > > > > you can see profile file in attachment. > > > > [cid:ii_j9998pph2_15f5b92f2cf47020] > > > > -- > > Thanks & Best Regards > > > > > > > > -- > > Thanks & Best Regards > > > > > > -- > > Regards, > > Hongxu. > > > > > > -- > Thanks & Best Regards > -- Thanks & Best Regards -- Thanks & Best Regards
Re: performance issue on big table join
Thanks LL. Your query options look good. As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in some big data situations. Very curious to the root cause. 在 02/11/2017 10:00, 俊杰陈 写道: +user list 2017-11-02 9:57 GMT+08:00 俊杰陈 <mailto:cjjnj...@gmail.com>: Hi Mostafa Cheng already put the profile in thread. Here is another profile for impala release version. you can also see the attachment. 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar <mailto:mmokh...@cloudera.com>: Attaching the query profile will be most helpful to investigate this issue. If you can capture the profile from the WebUI on the coordinator node it would be great. On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 <mailto:cjjnj...@gmail.com> wrote: Thanks Hongxu, Here are configurations on my cluster, most of them are default values. Which item do you think it may impact? ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] ABORT_ON_ERROR: [0] ALLOW_UNSUPPORTED_FORMATS: [0] APPX_COUNT_DISTINCT: [0] BATCH_SIZE: [0] COMPRESSION_CODEC: [NONE] DEBUG_ACTION: [] DEFAULT_ORDER_BY_LIMIT: [-1] DISABLE_CACHED_READS: [0] DISABLE_CODEGEN: [0] DISABLE_OUTERMOST_TOPN: [0] DISABLE_ROW_RUNTIME_FILTERING: [0] DISABLE_STREAMING_PREAGGREGATIONS: [0] DISABLE_UNSAFE_SPILLS: [0] ENABLE_EXPR_REWRITES: [1] EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] EXPLAIN_LEVEL: [1] HBASE_CACHE_BLOCKS: [0] HBASE_CACHING: [0] MAX_BLOCK_MGR_MEMORY: [0] MAX_ERRORS: [100] MAX_IO_BUFFERS: [0] MAX_NUM_RUNTIME_FILTERS: [10] MAX_SCAN_RANGE_LENGTH: [0] MEM_LIMIT: [0] MT_DOP: [0] NUM_NODES: [0] NUM_SCANNER_THREADS: [0] OPTIMIZE_PARTITION_KEY_SCANS: [0] PARQUET_ANNOTATE_STRINGS_UTF8: [0] PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] PARQUET_FILE_SIZE: [0] PREFETCH_MODE: [1] QUERY_TIMEOUT_S: [0] REPLICA_PREFERENCE: [0] REQUEST_POOL: [] RESERVATION_REQUEST_TIMEOUT: [0] RM_INITIAL_MEM: [0] RUNTIME_BLOOM_FILTER_SIZE: [1048576] RUNTIME_FILTER_MAX_SIZE: [16777216] RUNTIME_FILTER_MIN_SIZE: [1048576] RUNTIME_FILTER_MODE: [2] RUNTIME_FILTER_WAIT_TIME_MS: [0] S3_SKIP_INSERT_STAGING: [1] SCAN_NODE_CODEGEN_THRESHOLD: [180] SCHEDULE_RANDOM_REPLICA: [0] SCRATCH_LIMIT: [-1] SEQ_COMPRESSION_MODE: [0] STRICT_MODE: [0] SUPPORT_START_OVER: [false] SYNC_DDL: [0] V_CPU_CORES: [0] 2017-10-31 15:30 GMT+08:00 Hongxu Ma <mailto:inte...@outlook.com>: Hi JJ Consider it only takes 3mins on SparkSQL, maybe there are some mistakes in query options. Try run "set;" in impala-shell and check all query options, e.g: BATCH_SIZE: [0] DISABLE_CODEGEN: [0] RUNTIME_FILTER_MODE: GLOBAL Just a guess, thanks. 在 27/10/2017 10:25, 俊杰陈 写道: The profile file is damaged. Here is a screenshot for exec summary [cid:ii_j999ymep1_15f5ba563aeabb91] 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjjnj...@gmail.com><mailto:cjj nj...@gmail.com><mailto:cjjnj...@gmail.com>>: Hi Devs I met a performance issue on big table join. The query takes more than 3 hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes cluster. when running query, the left scanner and exchange node are very slow. Did I miss some key arguments? you can see profile file in attachment. [cid:ii_j9998pph2_15f5b92f2cf47020] -- Thanks & Best Regards -- Thanks & Best Regards -- Regards, Hongxu. -- Thanks & Best Regards -- Thanks & Best Regards -- Regards, Hongxu.
Re: performance issue on big table join
See my response on the other thread you started. The probe side of joins are are executed in a single thread per host. Impala can run multiple builds in parallel - but each build uses only a single thread. A single query might not be able to max out your CPU, but most realistic workloads run several queries concurrently. On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma wrote: > Thanks LL. Your query options look good. > > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in > some big data situations. > Very curious to the root cause. > > > 在 02/11/2017 10:00, 俊杰陈 写道: > > +user list > > 2017-11-02 9:57 GMT+08:00 俊杰陈 : > > > Hi Mostafa > > Cheng already put the profile in thread. > > Here is another profile for impala release version. you can also see the > attachment. > > > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar > : > > > Attaching the query profile will be most helpful to investigate this > issue. > > If you can capture the profile from the WebUI on the coordinator node it > would be great. > > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 > wrote: > > > Thanks Hongxu, > > Here are configurations on my cluster, most of them are default values. > Which item do you think it may impact? > > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] > ABORT_ON_ERROR: [0] > ALLOW_UNSUPPORTED_FORMATS: [0] > APPX_COUNT_DISTINCT: [0] > BATCH_SIZE: [0] > COMPRESSION_CODEC: [NONE] > DEBUG_ACTION: [] > DEFAULT_ORDER_BY_LIMIT: [-1] > DISABLE_CACHED_READS: [0] > DISABLE_CODEGEN: [0] > DISABLE_OUTERMOST_TOPN: [0] > DISABLE_ROW_RUNTIME_FILTERING: [0] > DISABLE_STREAMING_PREAGGREGATIONS: [0] > DISABLE_UNSAFE_SPILLS: [0] > ENABLE_EXPR_REWRITES: [1] > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] > EXPLAIN_LEVEL: [1] > HBASE_CACHE_BLOCKS: [0] > HBASE_CACHING: [0] > MAX_BLOCK_MGR_MEMORY: [0] > MAX_ERRORS: [100] > MAX_IO_BUFFERS: [0] > MAX_NUM_RUNTIME_FILTERS: [10] > MAX_SCAN_RANGE_LENGTH: [0] > MEM_LIMIT: [0] > MT_DOP: [0] > NUM_NODES: [0] > NUM_SCANNER_THREADS: [0] > OPTIMIZE_PARTITION_KEY_SCANS: [0] > PARQUET_ANNOTATE_STRINGS_UTF8: [0] > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] > PARQUET_FILE_SIZE: [0] > PREFETCH_MODE: [1] > QUERY_TIMEOUT_S: [0] > REPLICA_PREFERENCE: [0] > REQUEST_POOL: [] > RESERVATION_REQUEST_TIMEOUT: [0] > RM_INITIAL_MEM: [0] > RUNTIME_BLOOM_FILTER_SIZE: [1048576] > RUNTIME_FILTER_MAX_SIZE: [16777216] > RUNTIME_FILTER_MIN_SIZE: [1048576] > RUNTIME_FILTER_MODE: [2] > RUNTIME_FILTER_WAIT_TIME_MS: [0] > S3_SKIP_INSERT_STAGING: [1] > SCAN_NODE_CODEGEN_THRESHOLD: [180] > SCHEDULE_RANDOM_REPLICA: [0] > SCRATCH_LIMIT: [-1] > SEQ_COMPRESSION_MODE: [0] > STRICT_MODE: [0] > SUPPORT_START_OVER: [false] > SYNC_DDL: [0] > V_CPU_CORES: [0] > > 2017-10-31 15:30 GMT+08:00 Hongxu Ma > : > > > Hi JJ > Consider it only takes 3mins on SparkSQL, maybe there are some > > mistakes > > in > > query options. > Try run "set;" in impala-shell and check all query options, e.g: > BATCH_SIZE: [0] > DISABLE_CODEGEN: [0] > RUNTIME_FILTER_MODE: GLOBAL > > Just a guess, thanks. > > 在 27/10/2017 10:25, 俊杰陈 写道: > The profile file is damaged. Here is a screenshot for exec summary > [cid:ii_j999ymep1_15f5ba563aeabb91] > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj > nj...@gmail.com> >: > Hi Devs > > I met a performance issue on big table join. The query takes more > > than 3 > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > cluster. when running query, the left scanner and exchange node are > > very > > slow. Did I miss some key arguments? > > you can see profile file in attachment. > > [cid:ii_j9998pph2_15f5b92f2cf47020] > > -- > Thanks & Best Regards > > > > -- > Thanks & Best Regards > > > -- > Regards, > Hongxu. > > > > > -- > Thanks & Best Regards > > > > > -- > Thanks & Best Regards > > > > > -- > Regards, > Hongxu. > >
Re: performance issue on big table join
Thanks Alex to reply again. Do we have plan to support multi-thread join/aggregation? Or it is intented to be single thread to maximum query throughput? 2017-11-03 0:32 GMT+08:00 Alexander Behm : > See my response on the other thread you started. The probe side of joins > are are executed in a single thread per host. Impala can run multiple > builds in parallel - but each build uses only a single thread. > A single query might not be able to max out your CPU, but most realistic > workloads run several queries concurrently. > > On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma wrote: > > > Thanks LL. Your query options look good. > > > > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in > > some big data situations. > > Very curious to the root cause. > > > > > > 在 02/11/2017 10:00, 俊杰陈 写道: > > > > +user list > > > > 2017-11-02 9:57 GMT+08:00 俊杰陈 : > > > > > > Hi Mostafa > > > > Cheng already put the profile in thread. > > > > Here is another profile for impala release version. you can also see the > > attachment. > > > > > > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar < > mmokh...@cloudera.com>: > > > > > > Attaching the query profile will be most helpful to investigate this > > issue. > > > > If you can capture the profile from the WebUI on the coordinator node it > > would be great. > > > > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 < > cjjnj...@gmail.com> wrote: > > > > > > Thanks Hongxu, > > > > Here are configurations on my cluster, most of them are default values. > > Which item do you think it may impact? > > > > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] > > ABORT_ON_ERROR: [0] > > ALLOW_UNSUPPORTED_FORMATS: [0] > > APPX_COUNT_DISTINCT: [0] > > BATCH_SIZE: [0] > > COMPRESSION_CODEC: [NONE] > > DEBUG_ACTION: [] > > DEFAULT_ORDER_BY_LIMIT: [-1] > > DISABLE_CACHED_READS: [0] > > DISABLE_CODEGEN: [0] > > DISABLE_OUTERMOST_TOPN: [0] > > DISABLE_ROW_RUNTIME_FILTERING: [0] > > DISABLE_STREAMING_PREAGGREGATIONS: [0] > > DISABLE_UNSAFE_SPILLS: [0] > > ENABLE_EXPR_REWRITES: [1] > > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] > > EXPLAIN_LEVEL: [1] > > HBASE_CACHE_BLOCKS: [0] > > HBASE_CACHING: [0] > > MAX_BLOCK_MGR_MEMORY: [0] > > MAX_ERRORS: [100] > > MAX_IO_BUFFERS: [0] > > MAX_NUM_RUNTIME_FILTERS: [10] > > MAX_SCAN_RANGE_LENGTH: [0] > > MEM_LIMIT: [0] > > MT_DOP: [0] > > NUM_NODES: [0] > > NUM_SCANNER_THREADS: [0] > > OPTIMIZE_PARTITION_KEY_SCANS: [0] > > PARQUET_ANNOTATE_STRINGS_UTF8: [0] > > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] > > PARQUET_FILE_SIZE: [0] > > PREFETCH_MODE: [1] > > QUERY_TIMEOUT_S: [0] > > REPLICA_PREFERENCE: [0] > > REQUEST_POOL: [] > > RESERVATION_REQUEST_TIMEOUT: [0] > > RM_INITIAL_MEM: [0] > > RUNTIME_BLOOM_FILTER_SIZE: [1048576] > > RUNTIME_FILTER_MAX_SIZE: [16777216] > > RUNTIME_FILTER_MIN_SIZE: [1048576] > > RUNTIME_FILTER_MODE: [2] > > RUNTIME_FILTER_WAIT_TIME_MS: [0] > > S3_SKIP_INSERT_STAGING: [1] > > SCAN_NODE_CODEGEN_THRESHOLD: [180] > > SCHEDULE_RANDOM_REPLICA: [0] > > SCRATCH_LIMIT: [-1] > > SEQ_COMPRESSION_MODE: [0] > > STRICT_MODE: [0] > > SUPPORT_START_OVER: [false] > > SYNC_DDL: [0] > > V_CPU_CORES: [0] > > > > 2017-10-31 15:30 GMT+08:00 Hongxu Ma < > inte...@outlook.com>: > > > > > > Hi JJ > > Consider it only takes 3mins on SparkSQL, maybe there are some > > > > mistakes > > > > in > > > > query options. > > Try run "set;" in impala-shell and check all query options, e.g: > > BATCH_SIZE: [0] > > DISABLE_CODEGEN: [0] > > RUNTIME_FILTER_MODE: GLOBAL > > > > Just a guess, thanks. > > > > 在 27/10/2017 10:25, 俊杰陈 写道: > > The profile file is damaged. Here is a screenshot for exec summary > > [cid:ii_j999ymep1_15f5ba563aeabb91] > > > > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj > > nj...@gmail.com> >: > > Hi Devs > > > > I met a performance issue on big table join. The query takes more > > > > than 3 > > > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > > cluster. when running query, the left scanner and exchange node are > > > > very > > > > slow. Did I miss some key arguments? > > > > you can see profile file in attachment. > > > > [cid:ii_j9998pph2_15f5b92f2cf47020] > > > > -- > > Thanks & Best Regards > > > > > > > > -- > > Thanks & Best Regards > > > > > > -- > > Regards, > > Hongxu. > > > > > > > > > > -- > > Thanks & Best Regards > > > > > > > > > > -- > > Thanks & Best Regards > > > > > > > > > > -- > > Regards, > > Hongxu. > > > > > -- Thanks & Best Regards
Re: performance issue on big table join
You are welcome. You can track IMPALA-3902 to see our progress on supporting fully multi-threaded execution. Support for multi-threaded aggregations is already available. Certain other queries will also work in multi-threaded mode. The big limitation is that distributed joins and unions do not yet work (local joins for nested types are ok). We even enable multi-threading by default for some operations like COMPUTE STATS on Parquet. You can play around with multi-threaded execution using the MT_DOP query option. On Thu, Nov 2, 2017 at 6:17 PM, 俊杰陈 wrote: > Thanks Alex to reply again. > > Do we have plan to support multi-thread join/aggregation? Or it is > intented to be single thread to maximum query throughput? > > > > 2017-11-03 0:32 GMT+08:00 Alexander Behm : > >> See my response on the other thread you started. The probe side of joins >> are are executed in a single thread per host. Impala can run multiple >> builds in parallel - but each build uses only a single thread. >> A single query might not be able to max out your CPU, but most realistic >> workloads run several queries concurrently. >> >> On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma wrote: >> >> > Thanks LL. Your query options look good. >> > >> > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in >> > some big data situations. >> > Very curious to the root cause. >> > >> > >> > 在 02/11/2017 10:00, 俊杰陈 写道: >> > >> > +user list >> > >> > 2017-11-02 9:57 GMT+08:00 俊杰陈 > >: >> > >> > >> > Hi Mostafa >> > >> > Cheng already put the profile in thread. >> > >> > Here is another profile for impala release version. you can also see the >> > attachment. >> > >> > >> > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar < >> mmokh...@cloudera.com>: >> > >> > >> > Attaching the query profile will be most helpful to investigate this >> > issue. >> > >> > If you can capture the profile from the WebUI on the coordinator node it >> > would be great. >> > >> > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 < >> cjjnj...@gmail.com> wrote: >> > >> > >> > Thanks Hongxu, >> > >> > Here are configurations on my cluster, most of them are default values. >> > Which item do you think it may impact? >> > >> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0] >> > ABORT_ON_ERROR: [0] >> > ALLOW_UNSUPPORTED_FORMATS: [0] >> > APPX_COUNT_DISTINCT: [0] >> > BATCH_SIZE: [0] >> > COMPRESSION_CODEC: [NONE] >> > DEBUG_ACTION: [] >> > DEFAULT_ORDER_BY_LIMIT: [-1] >> > DISABLE_CACHED_READS: [0] >> > DISABLE_CODEGEN: [0] >> > DISABLE_OUTERMOST_TOPN: [0] >> > DISABLE_ROW_RUNTIME_FILTERING: [0] >> > DISABLE_STREAMING_PREAGGREGATIONS: [0] >> > DISABLE_UNSAFE_SPILLS: [0] >> > ENABLE_EXPR_REWRITES: [1] >> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100] >> > EXPLAIN_LEVEL: [1] >> > HBASE_CACHE_BLOCKS: [0] >> > HBASE_CACHING: [0] >> > MAX_BLOCK_MGR_MEMORY: [0] >> > MAX_ERRORS: [100] >> > MAX_IO_BUFFERS: [0] >> > MAX_NUM_RUNTIME_FILTERS: [10] >> > MAX_SCAN_RANGE_LENGTH: [0] >> > MEM_LIMIT: [0] >> > MT_DOP: [0] >> > NUM_NODES: [0] >> > NUM_SCANNER_THREADS: [0] >> > OPTIMIZE_PARTITION_KEY_SCANS: [0] >> > PARQUET_ANNOTATE_STRINGS_UTF8: [0] >> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0] >> > PARQUET_FILE_SIZE: [0] >> > PREFETCH_MODE: [1] >> > QUERY_TIMEOUT_S: [0] >> > REPLICA_PREFERENCE: [0] >> > REQUEST_POOL: [] >> > RESERVATION_REQUEST_TIMEOUT: [0] >> > RM_INITIAL_MEM: [0] >> > RUNTIME_BLOOM_FILTER_SIZE: [1048576] >> > RUNTIME_FILTER_MAX_SIZE: [16777216] >> > RUNTIME_FILTER_MIN_SIZE: [1048576] >> > RUNTIME_FILTER_MODE: [2] >> > RUNTIME_FILTER_WAIT_TIME_MS: [0] >> > S3_SKIP_INSERT_STAGING: [1] >> > SCAN_NODE_CODEGEN_THRESHOLD: [180] >> > SCHEDULE_RANDOM_REPLICA: [0] >> > SCRATCH_LIMIT: [-1] >> > SEQ_COMPRESSION_MODE: [0] >> > STRICT