performance issue on big table join

2017-10-26 Thread 俊杰陈
Hi Devs

I met a performance issue on big table join. The query takes more than 3
hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
cluster. when running query,  the left scanner and exchange node are very
slow.  Did I miss some key arguments?

you can see profile file in attachment.


​
-- 
Thanks & Best Regards
eJztnQlcU2e6/31PEtpqW0tbHZh2ZtLOtKMzStm3284drbZ2Wm2rzp3537n39lKIlhkWL4Euc+feG9nEsAhhESIiSwhbCAEia1hkCQgBwyIgIAJCZDcEDIuI//ckYUtyQoJYN/i0x5OzvOfdn+f9/t73HMO5VGCw60sPktsPxB2ODh+Ympqb2Jla2ptZmXxtY3LMxPbYMUuHr43tjBf+dm7TM9xloP+Ji529u+O3pKOu7nZORx2dSds2b9+0yeCFlb+3A/03gN+r+/bv/ePHxIOf79tP/NOew4c+OfSxs/yJJ9xcjzk6kYj2biQ7d5ID8btv0F9uHi4uji7HiXZE+Y1fezg6ORBdjxE/cT5h52RnRPwjmUQ8vP+z/XuO7JefJBPdXYnOJDuyhxuJ+F/yoElux1zdnO1c7ElGhgZqovAGjO5zRzycne3cfti2yfB9g1986Op8wsOddMTezuWwnctx0h4y2fG4izPJxR1NkBtMUQC7x3STLok39Htlj4OzIwzI1YXoRiJ7OLm/jh5wR1Pr6OxMcnCECXf64eUPXV1cSPboUZg4N7ybq6v7lg9dXd0cHF3s3F3dtn7tcIJkbWbmYm5ragpLYfM+0jE7GBhx39dvuZ+wdyB/9bWjy1cn7NzcHd3ho0gO6D7MCPevTIyNX95HciIdt1sIfNPz+10ciGg0f2ZqbGK128R4t6kl0dTY1tTE1sTGyNzK2swS3mWst9/NzdWNvGn7frK7o7Ps9i9IbrsPuJLdiQdJzi9YmZtYWRhbmVhv2f89yV6Rk41bNn8Osx6NM3HF3y/R+8hE4p5vj8ueTSQetPtesQfPHnb9Dp6ETzJa2IcPs/sb+iDF4aWf+0judo5ORF3+Nu9+DP82G1vZHtx/+GNYGXfv//OHB/Yc+nj/QoRN5P+YfWPqrLwHy2YxXcv3ict2FT//eOiLPYePfnL0k88P7d+nZU4Zm9se/fyL3YfUnLNAn2hkY2rljJaPqZGJsaVsDz2lEikTSyP476d7F46aGpmZL/up099mYzPbPR9/fHj/x3uO7lc6ZyEL3NTZwkoWFVMzZzMzuGdqbA6f/6n80dbWcP8g3DODkbIkHlTEwsQUPS77+dEnh/Z89sm/KgeuOVKmtgf2HDlA/MPnnyjnloX8ud+YGTsv7FnDPRNzYyMTy71KkUJz0oz4sSJS5tZmRmY2skh9cujQ/sOy4HcRtSzIzf/YvdvY0la5Oi1EytSGbGZigRYa3LM2tkH3TMzNFyKyYl/2Z7wyePgTTfIOe7KRPfkrR3eS81fkv+1cNaf+Ae80sT3y4Z5DxAP7PjqiFCkTc7KFhbksKjZkC2Nz7EiZwh3TxeKDWbVQfKv2gUb2drBvdj3+FdnOiUQm2pNh8VmozybiQvFZOZsay+qUiamziTV5RZGt2NeUU2SyEVmHnDI2Vp9NikiZkq1NZPlDNCMbw8qFFSlTY2sja5O15RQZ9t0kRT6R0Ue9LLe7xH8huaF27F8cZT8diN/Kf6PV18TIePeRQ3u+OHLg86MKq71Dbrat7RwsHYzNzEyMrU0tzE2sv7a0szNxMDe1gmbsa2Orr+2tv4b+xdf2O7ceIrl/5+r2N+IeBwdoKck/tYW+x7FjtibGRqZmqFmC3Y2prbm1sYWpAbRDRLkdcnQhHiZBo/atHZqScIDaSUsTExcTuZ3cYalI+E7iwhlTzDNmmGfM1Z6xMjXFOLPMUu8w3knEf+Fk5+K/TcUIbUZN4KJNhelw9XCzJy1PkC1q8VzdfvhAFv7BvZtVr16wzuTFa62MrWAPAi9WeDq2xKPfkIjHXJ2cXL9D/Sp3u6/RsrWDvpLMNYGH3KCH8K2di7v8HNHOxeE9aMDtXZ08nF2IZHcYF/gU2Og369jQdulW3zZv/gj29F98BhvAR4f3fHxw/6GjxL+sMGL/TvwG9SM+MCE6usB4Qe8O7qP9i0rGwPxwJjnvJimy5wPjvbLfbkuZCw9tRp+1+/DnsNYe+eTQp2hAWtz0D7WWWymiaFjQf4M19esfbIlLfQBxz5EP0XNOjtANtEWNpbZPhY3X44QTabejA/kDM6Kb63e7yY5/J31gar6XaG8n8xSdHN1/+EAWIsxIU+WMVNMfLWSnxbLstFCtZMp5CduErIapRHOhmi46EX/57JODnxxF47Rqhqx4AuotfLoO2bDCcfjLgpWXx8XDHbr7MCYeztCefeUCq6NsPOK+c5fsGHnFMfSO426uHieUYq8Sc7lToSZvzMzlx8knHJ2cdn/tcewYyQ2mU5ZfKxJlip0oU2srGxsrY1NzNGnL3Y+/YPgKspR+Y0f+Bo614GDDXt5RrKgExA+IK8w5esexv7134m+w/bv81cPF3h3eYUeGWQJHALITstxz/waW5PJA5YEsZRnxd4rnLB1Cb4SjO3fU64cDP3doQGyJhz+C3STx/d2qsViRr3K/6EHy1XiXyVLOmpli56ySD/UXNS6PLF+1brcrorEsEqgvuKLOmpuhkbC0kN2Ebj6CfpNSMz6859C+zw+qb7nadITmGPVTHtUVftpfdPerdhEV8UMDg3+L98GEW5tZvYdu0LJH+20bM0uiPB+sjIzNPt6ruAW1N+TdMJPISxlCJH3v7mZ3wtUJHYfKzzk4klFj5aC4S265ZPfaEmUXeLjYfQsHivLj6GOWHVDctMzCwbtUz+uQcUTdy3iFC6q2e16LaTDGfvbySg5bnS4160Gq1QqnVos6tdIDXapRytXJ1Pw9dLNYnUzN5OVsamxkbQOrExGzt9n9u5V94OaV1W4xnzRVO93rnMYKp0NtI+pc3Mqep7Gcv31+QpaVxB1kkju0a2iHf8zxuIeb7EmoF0g8AT1XF5LbzucPHv1q3+dffGD8Ky3u3Llpi/yqIzCZpOeh1f3kyIH9+15cOuhBRj7/dLP899EfTpAIX/5x/+H/9+JhEqwDsI594erq9KKDHHHtPgF/bD5CkjO0T/a9ZWNlbk4ysTG2PwZHDzaW5rZ2JtYkC0sSycrYxMzM1MbhxYWL0ZCf27t//5E/7fnzS0f+y0keHxTodQEy9Hjt3ZcZ8l1aOgLEY26uzsTNynVUuR/c/B20jyTNllaWw9qYzQXHY7nfseBKLT8m8yllLiVMqZu7CuQzsbI1sba1sDSytrawlg1PXj8qHwwcVAwE0Bwi/7uuXr5OTRq/SDgNpw2XFeyKYluWgEVcuay+LK9fK6qV0lhV9iwluKqEQ5XHncvQ6so6s5zHatUKdG5ksnEiBm5VW1CYA+EVLUkFQS8HtgrK+wagblqVgBvuMnjlC3lcZQXi5OhCMmQG+Hj/i7fXKW+zoBs9Z8yYSfSR//S9wabYxVwrjbJrTxwOto8JSgizl06FnrWPOTfUb2+YvO0gyd3OAVYhopOrnQPaG7rB9G5fefSYo4sj+RuSg04X6++Bfd8PZEfy4pE39/+Xh+O3sOJBQ0a0d4KOLNo+5Ql1MDgC8xJ21S6uDiRZMSxIEQaHVxqOpTv2waGwm+PX6P6KO/RlOYMWzcKTNxm8rKixi3kVIdGbG6kbd5i/kZdDoqZfyyEJr2ZcJXV3Rl4/ltc3GkXBseISh+7M4UtrzjOm5/D1ubyr8B/D5K3ykMgeX8vFA9WnvX5Edg6O892IdgsF/hpaok4kNK6Lx948TLJz+AFVTGRZSYTVwpL4tZ3930guDmTrPU5O8CcJVhAPWf1cOEHcYWIJuz6742htWPIMdi6Ux88+cnSD9c3hBxc7Z0d7RbbB+mZPcvyW5PCyjOkvGjt9+cXQcBGPkdztYexf+aOLG+k4zFp4k0zA2bTJ4EV5Wz4CKzXJbdsmw88MXv/QyRE+/iP0nj/ZOS7qMrGUs9cRbGHmJ/DpB2EZuTlCu/h3WQNZuLO0bPYltcrNJsPn0TaxQ+0jsQLcZGC2fzHnvlCIW6tKattegUnb+pG8oh1W5Ni2Tds3wxR9JLPlK+O8SUsJSlpFTw2ew8uSQ/Az3CsvSXmBQXOC+jQu9o4kcij40BXWdlui5S6iM+xH3iM6231vS7SwdibDfRMTS2do30wt3L8hvrMbJgheZ4YesTBedsQCPWK17BoTE1P0kI2x6qEVV8kCt7ExslE6+PpHihoEh+mL0KoT2QwNBZF4xM3eiHgIbbPEo8fd5bs7YF0kHrVzOw57WXdoIuBIaMEWLdRG2KG6OKCB7dj//QmZQYC3yKuinZsbmutE4lKDIRL3u8g9vR9HDlqJBlcAZVNltCz7O7z/4OdKcsgxOycySYUb77BYAZ4PvbdHZdfdzYP0qqyyLWTWRoavNcONl2W4hQnZwlKuksFdK0tzhWAmu0//kIfz1zCbXI8tmBlg8lPUpMtyekHwdIX9/h/JdsdJZ1XotimqFH2sBm6rnDBbOmG94oS5uhPLyfaKoJaDbVTgM5Ehb8M/q6ZEbfvF7oPU1D0NOSHrmZV7TNW+cpPBtj3QdMAbHIgfLZiuj4zNtj1vOKdv8Jri3NFvoAF3OOoKI0be9iLai76y18nV1Vke/N4f3EnkbXqaOt2taPzk0ZPFDl5NofxMcXyZUyYP5TX0KHQ8HdSdcUOdOUVwC2FxvU++Ao0hakGhSXHwsJcZh2oA7ZosIgr/VZENS9F6ffnZIzDXlU8dgY4sfMSCVZOfeleeOFme

Re: performance issue on big table join

2017-10-26 Thread 俊杰陈
The profile file is damaged. Here is a screenshot for exec summary

​

2017-10-27 10:04 GMT+08:00 俊杰陈 :

> Hi Devs
>
> I met a performance issue on big table join. The query takes more than 3
> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> cluster. when running query,  the left scanner and exchange node are very
> slow.  Did I miss some key arguments?
>
> you can see profile file in attachment.
>
>
> ​
> --
> Thanks & Best Regards
>



-- 
Thanks & Best Regards


Re: performance issue on big table join

2017-10-26 Thread Mostafa Mokhtar
Hi,

Looks like you are joining store_sales with catalog_sales on item_sk, this
kind of join condition is a many to many, which means the output number of
rows will be much larger then input number of rows, not sure if this is
intended.

Also did you run "compute stats [TABLE_NAME]" on both tables?

For a more comprehensive query try TPCDS Q17

select  i_item_id

   ,i_item_desc

   ,s_state

   ,count(ss_quantity) as store_sales_quantitycount

   ,avg(ss_quantity) as store_sales_quantityave

   ,stddev_samp(ss_quantity) as store_sales_quantitystdev

   ,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov

   ,count(sr_return_quantity) as store_returns_quantitycount

   ,avg(sr_return_quantity) as store_returns_quantityave

   ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev

   ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as
store_returns_quantitycov

   ,count(cs_quantity) as catalog_sales_quantitycount
,avg(cs_quantity) as catalog_sales_quantityave

   ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev

   ,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov

 from store_sales

 ,store_returns

 ,catalog_sales

 ,date_dim d1

 ,date_dim d2

 ,date_dim d3

 ,store

 ,item

 where d1.d_quarter_name = '2000Q1'

   and d1.d_date_sk = ss_sold_date_sk

   and i_item_sk = ss_item_sk

   and s_store_sk = ss_store_sk

   and ss_customer_sk = sr_customer_sk

   and ss_item_sk = sr_item_sk

   and ss_ticket_number = sr_ticket_number

   and sr_returned_date_sk = d2.d_date_sk

   and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')

   and sr_customer_sk = cs_bill_customer_sk

   and sr_item_sk = cs_item_sk

   and cs_sold_date_sk = d3.d_date_sk

   and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')

 group by i_item_id

 ,i_item_desc

 ,s_state

 order by i_item_id

 ,i_item_desc

 ,s_state

limit 100;


I recommend moving this kind of discussion on
u...@impala.incubator.apache.org.

On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈  wrote:

> The profile file is damaged. Here is a screenshot for exec summary
>
> ​
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 :
>
>> Hi Devs
>>
>> I met a performance issue on big table join. The query takes more than 3
>> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> cluster. when running query,  the left scanner and exchange node are very
>> slow.  Did I miss some key arguments?
>>
>> you can see profile file in attachment.
>>
>>
>> ​
>> --
>> Thanks & Best Regards
>>
>
>
>
> --
> Thanks & Best Regards
>


Re: performance issue on big table join

2017-10-31 Thread Hongxu Ma
Hi JJ
Consider it only takes 3mins on SparkSQL, maybe there are some mistakes in 
query options.
Try run "set;" in impala-shell and check all query options, e.g:
BATCH_SIZE: [0]
DISABLE_CODEGEN: [0]
RUNTIME_FILTER_MODE: GLOBAL

Just a guess, thanks.

在 27/10/2017 10:25, 俊杰陈 写道:
The profile file is damaged. Here is a screenshot for exec summary
[cid:ii_j999ymep1_15f5ba563aeabb91]
​

2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjjnj...@gmail.com>>:
Hi Devs

I met a performance issue on big table join. The query takes more than 3 hours 
on Impala and only 3 minutes on Spark SQL on the same 5 nodes cluster. when 
running query,  the left scanner and exchange node are very slow.  Did I miss 
some key arguments?

you can see profile file in attachment.

[cid:ii_j9998pph2_15f5b92f2cf47020]
​
--
Thanks & Best Regards



--
Thanks & Best Regards


--
Regards,
Hongxu.


Re: performance issue on big table join

2017-11-01 Thread 俊杰陈
Thanks Hongxu,

Here are configurations on my cluster,  most of them are default values.
Which item do you think it may impact?

ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
ABORT_ON_ERROR: [0]
ALLOW_UNSUPPORTED_FORMATS: [0]
APPX_COUNT_DISTINCT: [0]
BATCH_SIZE: [0]
COMPRESSION_CODEC: [NONE]
DEBUG_ACTION: []
DEFAULT_ORDER_BY_LIMIT: [-1]
DISABLE_CACHED_READS: [0]
DISABLE_CODEGEN: [0]
DISABLE_OUTERMOST_TOPN: [0]
DISABLE_ROW_RUNTIME_FILTERING: [0]
DISABLE_STREAMING_PREAGGREGATIONS: [0]
DISABLE_UNSAFE_SPILLS: [0]
ENABLE_EXPR_REWRITES: [1]
EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
EXPLAIN_LEVEL: [1]
HBASE_CACHE_BLOCKS: [0]
HBASE_CACHING: [0]
MAX_BLOCK_MGR_MEMORY: [0]
MAX_ERRORS: [100]
MAX_IO_BUFFERS: [0]
MAX_NUM_RUNTIME_FILTERS: [10]
MAX_SCAN_RANGE_LENGTH: [0]
MEM_LIMIT: [0]
MT_DOP: [0]
NUM_NODES: [0]
NUM_SCANNER_THREADS: [0]
OPTIMIZE_PARTITION_KEY_SCANS: [0]
PARQUET_ANNOTATE_STRINGS_UTF8: [0]
PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
PARQUET_FILE_SIZE: [0]
PREFETCH_MODE: [1]
QUERY_TIMEOUT_S: [0]
REPLICA_PREFERENCE: [0]
REQUEST_POOL: []
RESERVATION_REQUEST_TIMEOUT: [0]
RM_INITIAL_MEM: [0]
RUNTIME_BLOOM_FILTER_SIZE: [1048576]
RUNTIME_FILTER_MAX_SIZE: [16777216]
RUNTIME_FILTER_MIN_SIZE: [1048576]
RUNTIME_FILTER_MODE: [2]
RUNTIME_FILTER_WAIT_TIME_MS: [0]
S3_SKIP_INSERT_STAGING: [1]
SCAN_NODE_CODEGEN_THRESHOLD: [180]
SCHEDULE_RANDOM_REPLICA: [0]
SCRATCH_LIMIT: [-1]
SEQ_COMPRESSION_MODE: [0]
STRICT_MODE: [0]
SUPPORT_START_OVER: [false]
SYNC_DDL: [0]
V_CPU_CORES: [0]

2017-10-31 15:30 GMT+08:00 Hongxu Ma :

> Hi JJ
> Consider it only takes 3mins on SparkSQL, maybe there are some mistakes in
> query options.
> Try run "set;" in impala-shell and check all query options, e.g:
> BATCH_SIZE: [0]
> DISABLE_CODEGEN: [0]
> RUNTIME_FILTER_MODE: GLOBAL
>
> Just a guess, thanks.
>
> 在 27/10/2017 10:25, 俊杰陈 写道:
> The profile file is damaged. Here is a screenshot for exec summary
> [cid:ii_j999ymep1_15f5ba563aeabb91]
> ​
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj
> nj...@gmail.com>>:
> Hi Devs
>
> I met a performance issue on big table join. The query takes more than 3
> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> cluster. when running query,  the left scanner and exchange node are very
> slow.  Did I miss some key arguments?
>
> you can see profile file in attachment.
>
> [cid:ii_j9998pph2_15f5b92f2cf47020]
> ​
> --
> Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards
>
>
> --
> Regards,
> Hongxu.
>



-- 
Thanks & Best Regards


Re: performance issue on big table join

2017-11-01 Thread Mostafa Mokhtar
Attaching the query profile will be most helpful to investigate this issue.

If you can capture the profile from the WebUI on the coordinator node it
would be great.

On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈  wrote:

> Thanks Hongxu,
>
> Here are configurations on my cluster,  most of them are default values.
> Which item do you think it may impact?
>
> ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> ABORT_ON_ERROR: [0]
> ALLOW_UNSUPPORTED_FORMATS: [0]
> APPX_COUNT_DISTINCT: [0]
> BATCH_SIZE: [0]
> COMPRESSION_CODEC: [NONE]
> DEBUG_ACTION: []
> DEFAULT_ORDER_BY_LIMIT: [-1]
> DISABLE_CACHED_READS: [0]
> DISABLE_CODEGEN: [0]
> DISABLE_OUTERMOST_TOPN: [0]
> DISABLE_ROW_RUNTIME_FILTERING: [0]
> DISABLE_STREAMING_PREAGGREGATIONS: [0]
> DISABLE_UNSAFE_SPILLS: [0]
> ENABLE_EXPR_REWRITES: [1]
> EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> EXPLAIN_LEVEL: [1]
> HBASE_CACHE_BLOCKS: [0]
> HBASE_CACHING: [0]
> MAX_BLOCK_MGR_MEMORY: [0]
> MAX_ERRORS: [100]
> MAX_IO_BUFFERS: [0]
> MAX_NUM_RUNTIME_FILTERS: [10]
> MAX_SCAN_RANGE_LENGTH: [0]
> MEM_LIMIT: [0]
> MT_DOP: [0]
> NUM_NODES: [0]
> NUM_SCANNER_THREADS: [0]
> OPTIMIZE_PARTITION_KEY_SCANS: [0]
> PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> PARQUET_FILE_SIZE: [0]
> PREFETCH_MODE: [1]
> QUERY_TIMEOUT_S: [0]
> REPLICA_PREFERENCE: [0]
> REQUEST_POOL: []
> RESERVATION_REQUEST_TIMEOUT: [0]
> RM_INITIAL_MEM: [0]
> RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> RUNTIME_FILTER_MAX_SIZE: [16777216]
> RUNTIME_FILTER_MIN_SIZE: [1048576]
> RUNTIME_FILTER_MODE: [2]
> RUNTIME_FILTER_WAIT_TIME_MS: [0]
> S3_SKIP_INSERT_STAGING: [1]
> SCAN_NODE_CODEGEN_THRESHOLD: [180]
> SCHEDULE_RANDOM_REPLICA: [0]
> SCRATCH_LIMIT: [-1]
> SEQ_COMPRESSION_MODE: [0]
> STRICT_MODE: [0]
> SUPPORT_START_OVER: [false]
> SYNC_DDL: [0]
> V_CPU_CORES: [0]
>
> 2017-10-31 15:30 GMT+08:00 Hongxu Ma :
>
> > Hi JJ
> > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes
> in
> > query options.
> > Try run "set;" in impala-shell and check all query options, e.g:
> > BATCH_SIZE: [0]
> > DISABLE_CODEGEN: [0]
> > RUNTIME_FILTER_MODE: GLOBAL
> >
> > Just a guess, thanks.
> >
> > 在 27/10/2017 10:25, 俊杰陈 写道:
> > The profile file is damaged. Here is a screenshot for exec summary
> > [cid:ii_j999ymep1_15f5ba563aeabb91]
> > ​
> >
> > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj
> > nj...@gmail.com>>:
> > Hi Devs
> >
> > I met a performance issue on big table join. The query takes more than 3
> > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> > cluster. when running query,  the left scanner and exchange node are very
> > slow.  Did I miss some key arguments?
> >
> > you can see profile file in attachment.
> >
> > [cid:ii_j9998pph2_15f5b92f2cf47020]
> > ​
> > --
> > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
> >
> > --
> > Regards,
> > Hongxu.
> >
>
>
>
> --
> Thanks & Best Regards
>


Re: performance issue on big table join

2017-11-01 Thread 俊杰陈
Hi Mostafa

Cheng already put the profile in thread.

Here is another profile for impala release version. you can also see the
attachment.


2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar :

> Attaching the query profile will be most helpful to investigate this issue.
>
> If you can capture the profile from the WebUI on the coordinator node it
> would be great.
>
> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈  wrote:
>
> > Thanks Hongxu,
> >
> > Here are configurations on my cluster,  most of them are default values.
> > Which item do you think it may impact?
> >
> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> > ABORT_ON_ERROR: [0]
> > ALLOW_UNSUPPORTED_FORMATS: [0]
> > APPX_COUNT_DISTINCT: [0]
> > BATCH_SIZE: [0]
> > COMPRESSION_CODEC: [NONE]
> > DEBUG_ACTION: []
> > DEFAULT_ORDER_BY_LIMIT: [-1]
> > DISABLE_CACHED_READS: [0]
> > DISABLE_CODEGEN: [0]
> > DISABLE_OUTERMOST_TOPN: [0]
> > DISABLE_ROW_RUNTIME_FILTERING: [0]
> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
> > DISABLE_UNSAFE_SPILLS: [0]
> > ENABLE_EXPR_REWRITES: [1]
> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> > EXPLAIN_LEVEL: [1]
> > HBASE_CACHE_BLOCKS: [0]
> > HBASE_CACHING: [0]
> > MAX_BLOCK_MGR_MEMORY: [0]
> > MAX_ERRORS: [100]
> > MAX_IO_BUFFERS: [0]
> > MAX_NUM_RUNTIME_FILTERS: [10]
> > MAX_SCAN_RANGE_LENGTH: [0]
> > MEM_LIMIT: [0]
> > MT_DOP: [0]
> > NUM_NODES: [0]
> > NUM_SCANNER_THREADS: [0]
> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> > PARQUET_FILE_SIZE: [0]
> > PREFETCH_MODE: [1]
> > QUERY_TIMEOUT_S: [0]
> > REPLICA_PREFERENCE: [0]
> > REQUEST_POOL: []
> > RESERVATION_REQUEST_TIMEOUT: [0]
> > RM_INITIAL_MEM: [0]
> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> > RUNTIME_FILTER_MAX_SIZE: [16777216]
> > RUNTIME_FILTER_MIN_SIZE: [1048576]
> > RUNTIME_FILTER_MODE: [2]
> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
> > S3_SKIP_INSERT_STAGING: [1]
> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
> > SCHEDULE_RANDOM_REPLICA: [0]
> > SCRATCH_LIMIT: [-1]
> > SEQ_COMPRESSION_MODE: [0]
> > STRICT_MODE: [0]
> > SUPPORT_START_OVER: [false]
> > SYNC_DDL: [0]
> > V_CPU_CORES: [0]
> >
> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma :
> >
> > > Hi JJ
> > > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes
> > in
> > > query options.
> > > Try run "set;" in impala-shell and check all query options, e.g:
> > > BATCH_SIZE: [0]
> > > DISABLE_CODEGEN: [0]
> > > RUNTIME_FILTER_MODE: GLOBAL
> > >
> > > Just a guess, thanks.
> > >
> > > 在 27/10/2017 10:25, 俊杰陈 写道:
> > > The profile file is damaged. Here is a screenshot for exec summary
> > > [cid:ii_j999ymep1_15f5ba563aeabb91]
> > > ​
> > >
> > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj
> > > nj...@gmail.com>>:
> > > Hi Devs
> > >
> > > I met a performance issue on big table join. The query takes more than
> 3
> > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> > > cluster. when running query,  the left scanner and exchange node are
> very
> > > slow.  Did I miss some key arguments?
> > >
> > > you can see profile file in attachment.
> > >
> > > [cid:ii_j9998pph2_15f5b92f2cf47020]
> > > ​
> > > --
> > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> > >
> > > --
> > > Regards,
> > > Hongxu.
> > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>



-- 
Thanks & Best Regards
Query (id=db497c13276e70de:38c671cf):
  Summary:
Session ID: ee4f844616a8170f:5cf00c4759ee93af
Session Type: BEESWAX
Start Time: 2017-11-01 15:57:09.150268000
End Time: 2017-11-01 18:56:53.614915000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.8.0-SNAPSHOT RELEASE (build 
4a61dbd636d052c97b829617bb39b970d834

Re: performance issue on big table join

2017-11-01 Thread 俊杰陈
+user list

2017-11-02 9:57 GMT+08:00 俊杰陈 :

> Hi Mostafa
>
> Cheng already put the profile in thread.
>
> Here is another profile for impala release version. you can also see the
> attachment.
>
>
> 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar :
>
>> Attaching the query profile will be most helpful to investigate this
>> issue.
>>
>> If you can capture the profile from the WebUI on the coordinator node it
>> would be great.
>>
>> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈  wrote:
>>
>> > Thanks Hongxu,
>> >
>> > Here are configurations on my cluster,  most of them are default values.
>> > Which item do you think it may impact?
>> >
>> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
>> > ABORT_ON_ERROR: [0]
>> > ALLOW_UNSUPPORTED_FORMATS: [0]
>> > APPX_COUNT_DISTINCT: [0]
>> > BATCH_SIZE: [0]
>> > COMPRESSION_CODEC: [NONE]
>> > DEBUG_ACTION: []
>> > DEFAULT_ORDER_BY_LIMIT: [-1]
>> > DISABLE_CACHED_READS: [0]
>> > DISABLE_CODEGEN: [0]
>> > DISABLE_OUTERMOST_TOPN: [0]
>> > DISABLE_ROW_RUNTIME_FILTERING: [0]
>> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
>> > DISABLE_UNSAFE_SPILLS: [0]
>> > ENABLE_EXPR_REWRITES: [1]
>> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
>> > EXPLAIN_LEVEL: [1]
>> > HBASE_CACHE_BLOCKS: [0]
>> > HBASE_CACHING: [0]
>> > MAX_BLOCK_MGR_MEMORY: [0]
>> > MAX_ERRORS: [100]
>> > MAX_IO_BUFFERS: [0]
>> > MAX_NUM_RUNTIME_FILTERS: [10]
>> > MAX_SCAN_RANGE_LENGTH: [0]
>> > MEM_LIMIT: [0]
>> > MT_DOP: [0]
>> > NUM_NODES: [0]
>> > NUM_SCANNER_THREADS: [0]
>> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
>> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
>> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
>> > PARQUET_FILE_SIZE: [0]
>> > PREFETCH_MODE: [1]
>> > QUERY_TIMEOUT_S: [0]
>> > REPLICA_PREFERENCE: [0]
>> > REQUEST_POOL: []
>> > RESERVATION_REQUEST_TIMEOUT: [0]
>> > RM_INITIAL_MEM: [0]
>> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
>> > RUNTIME_FILTER_MAX_SIZE: [16777216]
>> > RUNTIME_FILTER_MIN_SIZE: [1048576]
>> > RUNTIME_FILTER_MODE: [2]
>> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
>> > S3_SKIP_INSERT_STAGING: [1]
>> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
>> > SCHEDULE_RANDOM_REPLICA: [0]
>> > SCRATCH_LIMIT: [-1]
>> > SEQ_COMPRESSION_MODE: [0]
>> > STRICT_MODE: [0]
>> > SUPPORT_START_OVER: [false]
>> > SYNC_DDL: [0]
>> > V_CPU_CORES: [0]
>> >
>> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma :
>> >
>> > > Hi JJ
>> > > Consider it only takes 3mins on SparkSQL, maybe there are some
>> mistakes
>> > in
>> > > query options.
>> > > Try run "set;" in impala-shell and check all query options, e.g:
>> > > BATCH_SIZE: [0]
>> > > DISABLE_CODEGEN: [0]
>> > > RUNTIME_FILTER_MODE: GLOBAL
>> > >
>> > > Just a guess, thanks.
>> > >
>> > > 在 27/10/2017 10:25, 俊杰陈 写道:
>> > > The profile file is damaged. Here is a screenshot for exec summary
>> > > [cid:ii_j999ymep1_15f5ba563aeabb91]
>> > > ​
>> > >
>> > > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj
>> > > nj...@gmail.com>>:
>> > > Hi Devs
>> > >
>> > > I met a performance issue on big table join. The query takes more
>> than 3
>> > > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> > > cluster. when running query,  the left scanner and exchange node are
>> very
>> > > slow.  Did I miss some key arguments?
>> > >
>> > > you can see profile file in attachment.
>> > >
>> > > [cid:ii_j9998pph2_15f5b92f2cf47020]
>> > > ​
>> > > --
>> > > Thanks & Best Regards
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks & Best Regards
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > Hongxu.
>> > >
>> >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>>
>
>
>
> --
> Thanks & Best Regards
>



-- 
Thanks & Best Regards


RE: performance issue on big table join

2017-11-01 Thread Xu, Cheng A
Thanks Junjie for the profile file. Another observation is that Impala has a 
much lower CPU utilization than Spark SQL (Impala is on the left and Spark SQL 
is on the right).

[cid:image001.jpg@01D353C5.E69BA8A0]
I think the major reason here might be that Impala is single thread based 
execution for non-scanner operator (join, aggregation) in single node [1].
Do you have any suggestion that we can improve its performance especially 
improving its CPU utilization? Please correct me if I miss anything here.

[1] https://issues.apache.org/jira/browse/IMPALA-3902

Best Regards
Ferdinand Xu

From: 俊杰陈 [mailto:cjjnj...@gmail.com]
Sent: Thursday, November 2, 2017 10:00 AM
To: dev@impala ; 
u...@impala.incubator.apache.org
Subject: Re: performance issue on big table join

+user list

2017-11-02 9:57 GMT+08:00 俊杰陈 mailto:cjjnj...@gmail.com>>:
Hi Mostafa

Cheng already put the profile in thread.

Here is another profile for impala release version. you can also see the 
attachment.


2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar 
mailto:mmokh...@cloudera.com>>:
Attaching the query profile will be most helpful to investigate this issue.

If you can capture the profile from the WebUI on the coordinator node it
would be great.

On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 
mailto:cjjnj...@gmail.com>> wrote:

> Thanks Hongxu,
>
> Here are configurations on my cluster,  most of them are default values.
> Which item do you think it may impact?
>
> ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> ABORT_ON_ERROR: [0]
> ALLOW_UNSUPPORTED_FORMATS: [0]
> APPX_COUNT_DISTINCT: [0]
> BATCH_SIZE: [0]
> COMPRESSION_CODEC: [NONE]
> DEBUG_ACTION: []
> DEFAULT_ORDER_BY_LIMIT: [-1]
> DISABLE_CACHED_READS: [0]
> DISABLE_CODEGEN: [0]
> DISABLE_OUTERMOST_TOPN: [0]
> DISABLE_ROW_RUNTIME_FILTERING: [0]
> DISABLE_STREAMING_PREAGGREGATIONS: [0]
> DISABLE_UNSAFE_SPILLS: [0]
> ENABLE_EXPR_REWRITES: [1]
> EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> EXPLAIN_LEVEL: [1]
> HBASE_CACHE_BLOCKS: [0]
> HBASE_CACHING: [0]
> MAX_BLOCK_MGR_MEMORY: [0]
> MAX_ERRORS: [100]
> MAX_IO_BUFFERS: [0]
> MAX_NUM_RUNTIME_FILTERS: [10]
> MAX_SCAN_RANGE_LENGTH: [0]
> MEM_LIMIT: [0]
> MT_DOP: [0]
> NUM_NODES: [0]
> NUM_SCANNER_THREADS: [0]
> OPTIMIZE_PARTITION_KEY_SCANS: [0]
> PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> PARQUET_FILE_SIZE: [0]
> PREFETCH_MODE: [1]
> QUERY_TIMEOUT_S: [0]
> REPLICA_PREFERENCE: [0]
> REQUEST_POOL: []
> RESERVATION_REQUEST_TIMEOUT: [0]
> RM_INITIAL_MEM: [0]
> RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> RUNTIME_FILTER_MAX_SIZE: [16777216]
> RUNTIME_FILTER_MIN_SIZE: [1048576]
> RUNTIME_FILTER_MODE: [2]
> RUNTIME_FILTER_WAIT_TIME_MS: [0]
> S3_SKIP_INSERT_STAGING: [1]
> SCAN_NODE_CODEGEN_THRESHOLD: [180]
> SCHEDULE_RANDOM_REPLICA: [0]
> SCRATCH_LIMIT: [-1]
> SEQ_COMPRESSION_MODE: [0]
> STRICT_MODE: [0]
> SUPPORT_START_OVER: [false]
> SYNC_DDL: [0]
> V_CPU_CORES: [0]
>
> 2017-10-31 15:30 GMT+08:00 Hongxu Ma 
> mailto:inte...@outlook.com>>:
>
> > Hi JJ
> > Consider it only takes 3mins on SparkSQL, maybe there are some mistakes
> in
> > query options.
> > Try run "set;" in impala-shell and check all query options, e.g:
> > BATCH_SIZE: [0]
> > DISABLE_CODEGEN: [0]
> > RUNTIME_FILTER_MODE: GLOBAL
> >
> > Just a guess, thanks.
> >
> > 在 27/10/2017 10:25, 俊杰陈 写道:
> > The profile file is damaged. Here is a screenshot for exec summary
> > [cid:ii_j999ymep1_15f5ba563aeabb91]
> > ​
> >
> > 2017-10-27 10:04 GMT+08:00 俊杰陈 
> > mailto:cjjnj...@gmail.com><mailto:cjj<mailto:cjj>
> > nj...@gmail.com<mailto:nj...@gmail.com>>>:
> > Hi Devs
> >
> > I met a performance issue on big table join. The query takes more than 3
> > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> > cluster. when running query,  the left scanner and exchange node are very
> > slow.  Did I miss some key arguments?
> >
> > you can see profile file in attachment.
> >
> > [cid:ii_j9998pph2_15f5b92f2cf47020]
> > ​
> > --
> > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
> >
> > --
> > Regards,
> > Hongxu.
> >
>
>
>
> --
> Thanks & Best Regards
>



--
Thanks & Best Regards



--
Thanks & Best Regards


Re: performance issue on big table join

2017-11-02 Thread Hongxu Ma
Thanks LL. Your query options look good.

As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in some 
big data situations.
Very curious to the root cause.

在 02/11/2017 10:00, 俊杰陈 写道:

+user list

2017-11-02 9:57 GMT+08:00 俊杰陈 <mailto:cjjnj...@gmail.com>:



Hi Mostafa

Cheng already put the profile in thread.

Here is another profile for impala release version. you can also see the
attachment.


2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar 
<mailto:mmokh...@cloudera.com>:



Attaching the query profile will be most helpful to investigate this
issue.

If you can capture the profile from the WebUI on the coordinator node it
would be great.

On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈 
<mailto:cjjnj...@gmail.com> wrote:



Thanks Hongxu,

Here are configurations on my cluster,  most of them are default values.
Which item do you think it may impact?

ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
ABORT_ON_ERROR: [0]
ALLOW_UNSUPPORTED_FORMATS: [0]
APPX_COUNT_DISTINCT: [0]
BATCH_SIZE: [0]
COMPRESSION_CODEC: [NONE]
DEBUG_ACTION: []
DEFAULT_ORDER_BY_LIMIT: [-1]
DISABLE_CACHED_READS: [0]
DISABLE_CODEGEN: [0]
DISABLE_OUTERMOST_TOPN: [0]
DISABLE_ROW_RUNTIME_FILTERING: [0]
DISABLE_STREAMING_PREAGGREGATIONS: [0]
DISABLE_UNSAFE_SPILLS: [0]
ENABLE_EXPR_REWRITES: [1]
EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
EXPLAIN_LEVEL: [1]
HBASE_CACHE_BLOCKS: [0]
HBASE_CACHING: [0]
MAX_BLOCK_MGR_MEMORY: [0]
MAX_ERRORS: [100]
MAX_IO_BUFFERS: [0]
MAX_NUM_RUNTIME_FILTERS: [10]
MAX_SCAN_RANGE_LENGTH: [0]
MEM_LIMIT: [0]
MT_DOP: [0]
NUM_NODES: [0]
NUM_SCANNER_THREADS: [0]
OPTIMIZE_PARTITION_KEY_SCANS: [0]
PARQUET_ANNOTATE_STRINGS_UTF8: [0]
PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
PARQUET_FILE_SIZE: [0]
PREFETCH_MODE: [1]
QUERY_TIMEOUT_S: [0]
REPLICA_PREFERENCE: [0]
REQUEST_POOL: []
RESERVATION_REQUEST_TIMEOUT: [0]
RM_INITIAL_MEM: [0]
RUNTIME_BLOOM_FILTER_SIZE: [1048576]
RUNTIME_FILTER_MAX_SIZE: [16777216]
RUNTIME_FILTER_MIN_SIZE: [1048576]
RUNTIME_FILTER_MODE: [2]
RUNTIME_FILTER_WAIT_TIME_MS: [0]
S3_SKIP_INSERT_STAGING: [1]
SCAN_NODE_CODEGEN_THRESHOLD: [180]
SCHEDULE_RANDOM_REPLICA: [0]
SCRATCH_LIMIT: [-1]
SEQ_COMPRESSION_MODE: [0]
STRICT_MODE: [0]
SUPPORT_START_OVER: [false]
SYNC_DDL: [0]
V_CPU_CORES: [0]

2017-10-31 15:30 GMT+08:00 Hongxu Ma 
<mailto:inte...@outlook.com>:



Hi JJ
Consider it only takes 3mins on SparkSQL, maybe there are some


mistakes


in


query options.
Try run "set;" in impala-shell and check all query options, e.g:
BATCH_SIZE: [0]
DISABLE_CODEGEN: [0]
RUNTIME_FILTER_MODE: GLOBAL

Just a guess, thanks.

在 27/10/2017 10:25, 俊杰陈 写道:
The profile file is damaged. Here is a screenshot for exec summary
[cid:ii_j999ymep1_15f5ba563aeabb91]
​

2017-10-27 10:04 GMT+08:00 俊杰陈 
mailto:cjjnj...@gmail.com><mailto:cjj
nj...@gmail.com><mailto:cjjnj...@gmail.com>>:
Hi Devs

I met a performance issue on big table join. The query takes more


than 3


hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
cluster. when running query,  the left scanner and exchange node are


very


slow.  Did I miss some key arguments?

you can see profile file in attachment.

[cid:ii_j9998pph2_15f5b92f2cf47020]
​
--
Thanks & Best Regards



--
Thanks & Best Regards


--
Regards,
Hongxu.






--
Thanks & Best Regards









--
Thanks & Best Regards









--
Regards,
Hongxu.


Re: performance issue on big table join

2017-11-02 Thread Alexander Behm
See my response on the other thread you started. The probe side of joins
are are executed in a single thread per host. Impala can run multiple
builds in parallel - but each build uses only a single thread.
A single query might not be able to max out your CPU, but most realistic
workloads run several queries concurrently.

On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma  wrote:

> Thanks LL. Your query options look good.
>
> As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in
> some big data situations.
> Very curious to the root cause.
>
>
> 在 02/11/2017 10:00, 俊杰陈 写道:
>
> +user list
>
> 2017-11-02 9:57 GMT+08:00 俊杰陈  :
>
>
> Hi Mostafa
>
> Cheng already put the profile in thread.
>
> Here is another profile for impala release version. you can also see the
> attachment.
>
>
> 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar  
> :
>
>
> Attaching the query profile will be most helpful to investigate this
> issue.
>
> If you can capture the profile from the WebUI on the coordinator node it
> would be great.
>
> On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈   
> wrote:
>
>
> Thanks Hongxu,
>
> Here are configurations on my cluster,  most of them are default values.
> Which item do you think it may impact?
>
> ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> ABORT_ON_ERROR: [0]
> ALLOW_UNSUPPORTED_FORMATS: [0]
> APPX_COUNT_DISTINCT: [0]
> BATCH_SIZE: [0]
> COMPRESSION_CODEC: [NONE]
> DEBUG_ACTION: []
> DEFAULT_ORDER_BY_LIMIT: [-1]
> DISABLE_CACHED_READS: [0]
> DISABLE_CODEGEN: [0]
> DISABLE_OUTERMOST_TOPN: [0]
> DISABLE_ROW_RUNTIME_FILTERING: [0]
> DISABLE_STREAMING_PREAGGREGATIONS: [0]
> DISABLE_UNSAFE_SPILLS: [0]
> ENABLE_EXPR_REWRITES: [1]
> EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> EXPLAIN_LEVEL: [1]
> HBASE_CACHE_BLOCKS: [0]
> HBASE_CACHING: [0]
> MAX_BLOCK_MGR_MEMORY: [0]
> MAX_ERRORS: [100]
> MAX_IO_BUFFERS: [0]
> MAX_NUM_RUNTIME_FILTERS: [10]
> MAX_SCAN_RANGE_LENGTH: [0]
> MEM_LIMIT: [0]
> MT_DOP: [0]
> NUM_NODES: [0]
> NUM_SCANNER_THREADS: [0]
> OPTIMIZE_PARTITION_KEY_SCANS: [0]
> PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> PARQUET_FILE_SIZE: [0]
> PREFETCH_MODE: [1]
> QUERY_TIMEOUT_S: [0]
> REPLICA_PREFERENCE: [0]
> REQUEST_POOL: []
> RESERVATION_REQUEST_TIMEOUT: [0]
> RM_INITIAL_MEM: [0]
> RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> RUNTIME_FILTER_MAX_SIZE: [16777216]
> RUNTIME_FILTER_MIN_SIZE: [1048576]
> RUNTIME_FILTER_MODE: [2]
> RUNTIME_FILTER_WAIT_TIME_MS: [0]
> S3_SKIP_INSERT_STAGING: [1]
> SCAN_NODE_CODEGEN_THRESHOLD: [180]
> SCHEDULE_RANDOM_REPLICA: [0]
> SCRATCH_LIMIT: [-1]
> SEQ_COMPRESSION_MODE: [0]
> STRICT_MODE: [0]
> SUPPORT_START_OVER: [false]
> SYNC_DDL: [0]
> V_CPU_CORES: [0]
>
> 2017-10-31 15:30 GMT+08:00 Hongxu Ma  
> :
>
>
> Hi JJ
> Consider it only takes 3mins on SparkSQL, maybe there are some
>
> mistakes
>
> in
>
> query options.
> Try run "set;" in impala-shell and check all query options, e.g:
> BATCH_SIZE: [0]
>     DISABLE_CODEGEN: [0]
> RUNTIME_FILTER_MODE: GLOBAL
>
> Just a guess, thanks.
>
> 在 27/10/2017 10:25, 俊杰陈 写道:
> The profile file is damaged. Here is a screenshot for exec summary
> [cid:ii_j999ymep1_15f5ba563aeabb91]
> ​
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj
> nj...@gmail.com> >:
> Hi Devs
>
> I met a performance issue on big table join. The query takes more
>
> than 3
>
> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> cluster. when running query,  the left scanner and exchange node are
>
> very
>
> slow.  Did I miss some key arguments?
>
> you can see profile file in attachment.
>
> [cid:ii_j9998pph2_15f5b92f2cf47020]
> ​
> --
> Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards
>
>
> --
> Regards,
> Hongxu.
>
>
>
>
> --
> Thanks & Best Regards
>
>
>
>
> --
> Thanks & Best Regards
>
>
>
>
> --
> Regards,
> Hongxu.
>
>


Re: performance issue on big table join

2017-11-02 Thread 俊杰陈
Thanks Alex to reply again.

Do we have plan to support multi-thread join/aggregation?  Or it is
intented to be single thread to maximum query throughput?



2017-11-03 0:32 GMT+08:00 Alexander Behm :

> See my response on the other thread you started. The probe side of joins
> are are executed in a single thread per host. Impala can run multiple
> builds in parallel - but each build uses only a single thread.
> A single query might not be able to max out your CPU, but most realistic
> workloads run several queries concurrently.
>
> On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma  wrote:
>
> > Thanks LL. Your query options look good.
> >
> > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in
> > some big data situations.
> > Very curious to the root cause.
> >
> >
> > 在 02/11/2017 10:00, 俊杰陈 写道:
> >
> > +user list
> >
> > 2017-11-02 9:57 GMT+08:00 俊杰陈  :
> >
> >
> > Hi Mostafa
> >
> > Cheng already put the profile in thread.
> >
> > Here is another profile for impala release version. you can also see the
> > attachment.
> >
> >
> > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar  <
> mmokh...@cloudera.com>:
> >
> >
> > Attaching the query profile will be most helpful to investigate this
> > issue.
> >
> > If you can capture the profile from the WebUI on the coordinator node it
> > would be great.
> >
> > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈  <
> cjjnj...@gmail.com> wrote:
> >
> >
> > Thanks Hongxu,
> >
> > Here are configurations on my cluster,  most of them are default values.
> > Which item do you think it may impact?
> >
> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
> > ABORT_ON_ERROR: [0]
> > ALLOW_UNSUPPORTED_FORMATS: [0]
> > APPX_COUNT_DISTINCT: [0]
> > BATCH_SIZE: [0]
> > COMPRESSION_CODEC: [NONE]
> > DEBUG_ACTION: []
> > DEFAULT_ORDER_BY_LIMIT: [-1]
> > DISABLE_CACHED_READS: [0]
> > DISABLE_CODEGEN: [0]
> > DISABLE_OUTERMOST_TOPN: [0]
> > DISABLE_ROW_RUNTIME_FILTERING: [0]
> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
> > DISABLE_UNSAFE_SPILLS: [0]
> > ENABLE_EXPR_REWRITES: [1]
> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
> > EXPLAIN_LEVEL: [1]
> > HBASE_CACHE_BLOCKS: [0]
> > HBASE_CACHING: [0]
> > MAX_BLOCK_MGR_MEMORY: [0]
> > MAX_ERRORS: [100]
> > MAX_IO_BUFFERS: [0]
> > MAX_NUM_RUNTIME_FILTERS: [10]
> > MAX_SCAN_RANGE_LENGTH: [0]
> > MEM_LIMIT: [0]
> > MT_DOP: [0]
> > NUM_NODES: [0]
> > NUM_SCANNER_THREADS: [0]
> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
> > PARQUET_FILE_SIZE: [0]
> > PREFETCH_MODE: [1]
> > QUERY_TIMEOUT_S: [0]
> > REPLICA_PREFERENCE: [0]
> > REQUEST_POOL: []
> > RESERVATION_REQUEST_TIMEOUT: [0]
> > RM_INITIAL_MEM: [0]
> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
> > RUNTIME_FILTER_MAX_SIZE: [16777216]
> > RUNTIME_FILTER_MIN_SIZE: [1048576]
> > RUNTIME_FILTER_MODE: [2]
> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
> > S3_SKIP_INSERT_STAGING: [1]
> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
> > SCHEDULE_RANDOM_REPLICA: [0]
> > SCRATCH_LIMIT: [-1]
> > SEQ_COMPRESSION_MODE: [0]
> > STRICT_MODE: [0]
> > SUPPORT_START_OVER: [false]
> > SYNC_DDL: [0]
> > V_CPU_CORES: [0]
> >
> > 2017-10-31 15:30 GMT+08:00 Hongxu Ma  <
> inte...@outlook.com>:
> >
> >
> > Hi JJ
> > Consider it only takes 3mins on SparkSQL, maybe there are some
> >
> > mistakes
> >
> > in
> >
> > query options.
> > Try run "set;" in impala-shell and check all query options, e.g:
> > BATCH_SIZE: [0]
> > DISABLE_CODEGEN: [0]
> > RUNTIME_FILTER_MODE: GLOBAL
> >
> > Just a guess, thanks.
> >
> > 在 27/10/2017 10:25, 俊杰陈 写道:
> > The profile file is damaged. Here is a screenshot for exec summary
> > [cid:ii_j999ymep1_15f5ba563aeabb91]
> > ​
> >
> > 2017-10-27 10:04 GMT+08:00 俊杰陈 mailto:cjj
> > nj...@gmail.com> >:
> > Hi Devs
> >
> > I met a performance issue on big table join. The query takes more
> >
> > than 3
> >
> > hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> > cluster. when running query,  the left scanner and exchange node are
> >
> > very
> >
> > slow.  Did I miss some key arguments?
> >
> > you can see profile file in attachment.
> >
> > [cid:ii_j9998pph2_15f5b92f2cf47020]
> > ​
> > --
> > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
> >
> > --
> > Regards,
> > Hongxu.
> >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
> >
> >
> >
> > --
> > Regards,
> > Hongxu.
> >
> >
>



-- 
Thanks & Best Regards


Re: performance issue on big table join

2017-11-02 Thread Alexander Behm
You are welcome.

You can track IMPALA-3902 to see our progress on supporting fully
multi-threaded execution.

Support for multi-threaded aggregations is already available. Certain other
queries will also work in multi-threaded mode. The big limitation is that
distributed joins and unions do not yet work (local joins for nested types
are ok).
We even enable multi-threading by default for some operations like COMPUTE
STATS on Parquet.

You can play around with multi-threaded execution using the MT_DOP query
option.

On Thu, Nov 2, 2017 at 6:17 PM, 俊杰陈  wrote:

> Thanks Alex to reply again.
>
> Do we have plan to support multi-thread join/aggregation?  Or it is
> intented to be single thread to maximum query throughput?
>
>
>
> 2017-11-03 0:32 GMT+08:00 Alexander Behm :
>
>> See my response on the other thread you started. The probe side of joins
>> are are executed in a single thread per host. Impala can run multiple
>> builds in parallel - but each build uses only a single thread.
>> A single query might not be able to max out your CPU, but most realistic
>> workloads run several queries concurrently.
>>
>> On Thu, Nov 2, 2017 at 12:22 AM, Hongxu Ma  wrote:
>>
>> > Thanks LL. Your query options look good.
>> >
>> > As Xu Cheng mentioned, I also noticed that Impala do hash join slowly in
>> > some big data situations.
>> > Very curious to the root cause.
>> >
>> >
>> > 在 02/11/2017 10:00, 俊杰陈 写道:
>> >
>> > +user list
>> >
>> > 2017-11-02 9:57 GMT+08:00 俊杰陈  > >:
>> >
>> >
>> > Hi Mostafa
>> >
>> > Cheng already put the profile in thread.
>> >
>> > Here is another profile for impala release version. you can also see the
>> > attachment.
>> >
>> >
>> > 2017-11-02 9:30 GMT+08:00 Mostafa Mokhtar  <
>> mmokh...@cloudera.com>:
>> >
>> >
>> > Attaching the query profile will be most helpful to investigate this
>> > issue.
>> >
>> > If you can capture the profile from the WebUI on the coordinator node it
>> > would be great.
>> >
>> > On Wed, Nov 1, 2017 at 6:22 PM, 俊杰陈  <
>> cjjnj...@gmail.com> wrote:
>> >
>> >
>> > Thanks Hongxu,
>> >
>> > Here are configurations on my cluster,  most of them are default values.
>> > Which item do you think it may impact?
>> >
>> > ABORT_ON_DEFAULT_LIMIT_EXCEEDED: [0]
>> > ABORT_ON_ERROR: [0]
>> > ALLOW_UNSUPPORTED_FORMATS: [0]
>> > APPX_COUNT_DISTINCT: [0]
>> > BATCH_SIZE: [0]
>> > COMPRESSION_CODEC: [NONE]
>> > DEBUG_ACTION: []
>> > DEFAULT_ORDER_BY_LIMIT: [-1]
>> > DISABLE_CACHED_READS: [0]
>> > DISABLE_CODEGEN: [0]
>> > DISABLE_OUTERMOST_TOPN: [0]
>> > DISABLE_ROW_RUNTIME_FILTERING: [0]
>> > DISABLE_STREAMING_PREAGGREGATIONS: [0]
>> > DISABLE_UNSAFE_SPILLS: [0]
>> > ENABLE_EXPR_REWRITES: [1]
>> > EXEC_SINGLE_NODE_ROWS_THRESHOLD: [100]
>> > EXPLAIN_LEVEL: [1]
>> > HBASE_CACHE_BLOCKS: [0]
>> > HBASE_CACHING: [0]
>> > MAX_BLOCK_MGR_MEMORY: [0]
>> > MAX_ERRORS: [100]
>> > MAX_IO_BUFFERS: [0]
>> > MAX_NUM_RUNTIME_FILTERS: [10]
>> > MAX_SCAN_RANGE_LENGTH: [0]
>> > MEM_LIMIT: [0]
>> > MT_DOP: [0]
>> > NUM_NODES: [0]
>> > NUM_SCANNER_THREADS: [0]
>> > OPTIMIZE_PARTITION_KEY_SCANS: [0]
>> > PARQUET_ANNOTATE_STRINGS_UTF8: [0]
>> > PARQUET_FALLBACK_SCHEMA_RESOLUTION: [0]
>> > PARQUET_FILE_SIZE: [0]
>> > PREFETCH_MODE: [1]
>> > QUERY_TIMEOUT_S: [0]
>> > REPLICA_PREFERENCE: [0]
>> > REQUEST_POOL: []
>> > RESERVATION_REQUEST_TIMEOUT: [0]
>> > RM_INITIAL_MEM: [0]
>> > RUNTIME_BLOOM_FILTER_SIZE: [1048576]
>> > RUNTIME_FILTER_MAX_SIZE: [16777216]
>> > RUNTIME_FILTER_MIN_SIZE: [1048576]
>> > RUNTIME_FILTER_MODE: [2]
>> > RUNTIME_FILTER_WAIT_TIME_MS: [0]
>> > S3_SKIP_INSERT_STAGING: [1]
>> > SCAN_NODE_CODEGEN_THRESHOLD: [180]
>> > SCHEDULE_RANDOM_REPLICA: [0]
>> > SCRATCH_LIMIT: [-1]
>> > SEQ_COMPRESSION_MODE: [0]
>> > STRICT