Hi long CASE can be problem. Why you don't use a dictionary table and join?
Regards Pavel 2015-03-31 10:53 GMT+02:00 Kevin Viraud <kevin.vir...@rocket-internet.de>: > Hi, > > > > I have an issue with a rather large CASE WHEN and I cannot figure out why > it is so slow... > > > > First, here is my test query : > > > > SELECT CASE WHEN dwh_company_id = 1 > > > THEN CASE > > > > > WHEN wv.source ~ '^$' THEN 'Not tracked' > > > > WHEN wv.source ~ '^1$' THEN 'Not tracked1' > > > WHEN wv.source ~ '^2$' THEN 'Not tracked2' > > > WHEN wv.source ~ '^3$' THEN 'Not tracked3' > > > WHEN wv.source ~ '^4$' THEN 'Not tracked4' > > > WHEN wv.source ~ '^5$' THEN 'Not tracked5' > > > WHEN wv.source ~ '^6$' THEN 'Not tracked6' > > > WHEN wv.source ~ '^7$' THEN 'Not tracked7' > > > WHEN wv.source ~ '^8$' THEN 'Not tracked8' > > > WHEN wv.source ~ '^9$' THEN 'Not tracked9' > > > WHEN wv.source ~ '^10$' THEN 'Not tracked10' > > > WHEN wv.source ~ '^11$' THEN 'Not tracked11' > > > WHEN wv.source ~ '^12$' THEN 'Not tracked12' > > > WHEN wv.source ~ '^13$' THEN 'Not tracked13' > > > WHEN wv.source ~ '^14$' THEN 'Not tracked14' > > > WHEN wv.source ~ '^15$' THEN 'Not tracked15' > > > WHEN wv.source ~ '^16$' THEN 'Not tracked16' > > > WHEN wv.source ~ '^17$' THEN 'Not tracked17' > > > WHEN wv.source ~ '^18$' THEN 'Not tracked18' > > > WHEN wv.source ~ '^19$' THEN 'Not tracked19' > > > WHEN wv.source ~ '^20$' THEN 'Not tracked20' > > > WHEN wv.source ~ '^21$' THEN 'Not tracked21' > > > WHEN wv.source ~ '^22$' THEN 'Not tracked22' > > > WHEN wv.source ~ '^23$' THEN 'Not tracked23' > > > WHEN wv.source ~ '^24$' THEN 'Not tracked24' > > > WHEN wv.source ~ '^25$' THEN 'Not tracked25' > > > WHEN wv.source ~ '^26$' THEN 'Not tracked26' > > > WHEN wv.source ~ '^27$' THEN 'Not tracked27' > > > WHEN wv.source ~ '^28$' THEN 'Not tracked28' > > > --WHEN wv.source ~ '^29$' THEN 'Not tracked29' > > > WHEN wv.source ~ '^30$' THEN 'Not tracked30' > > > WHEN wv.source ~ '^31$' THEN 'Not tracked31' > > > WHEN wv.source ~ '^32$' THEN 'Not tracked32' > > > END > > ELSE > > 'Others' > > END as channel > > FROM ( > > SELECT wv.id, > > wv.ga_id, > > split_part(wv.ga_source_medium, ' / > ', 1) as source, > > ga.dwh_source_id, > > s.dwh_company_id > > FROM marketing.web_visits wv > > INNER JOIN dwh_metadata.google_analytics ga > ON ga.ga_id = wv.ga_id > > INNER JOIN dwh_manager.sources s ON > ga.dwh_source_id =s.dwh_source_id > > --WHERE s.dwh_company_id = 1 > > LIMIT 100000 > > ) wv > > > > > > This is a pretty simple case, my subquery (or CTE when using WITH > statement) should return 5 fields with more or less this structure : > > Id : character(32) > > Ga_id : bigint > > Source : character(32) > > Medium : character(32) > > dwh_company_id : bigint > > > > On top of which I apply a case when statement… > > > > Now the weird thing is, using this query I notice a significant drop in > performance as the “case when” is getting bigger. If I run the query as if, > I get the following exec plain and execution time: > > Subquery Scan on wv (cost=6.00..29098.17 rows=100000 width=36) (actual > time=0.828..22476.917 rows=100000 loops=1) > > Buffers: shared hit=3136 > > -> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual > time=0.209..133.429 rows=100000 loops=1) > > Buffers: shared hit=3136 > > -> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) > (actual time=0.208..119.297 rows=100000 loops=1) > > Hash Cond: (wv_1.ga_id = ga.ga_id) > > Buffers: shared hit=3136 > > -> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 > rows=20587078 width=50) (actual time=0.004..18.412 rows=100000 loops=1) > > Buffers: shared hit=3133 > > -> Hash (cost=5.50..5.50 rows=40 width=12) (actual > time=0.184..0.184 rows=111 loops=1) > > Buckets: 1024 Batches: 1 Memory Usage: 5kB > > Buffers: shared hit=3 > > -> Hash Join (cost=1.88..5.50 rows=40 width=12) > (actual time=0.056..0.148 rows=111 loops=1) > > Hash Cond: (ga.dwh_source_id = s.dwh_source_id) > > Buffers: shared hit=3 > > -> Seq Scan on google_analytics ga > (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.028 rows=111 > loops=1) > > Buffers: shared hit=2 > > -> Hash (cost=1.39..1.39 rows=39 width=8) > (actual time=0.042..0.042 rows=56 loops=1) > > Buckets: 1024 Batches: 1 Memory Usage: > 3kB > > Buffers: shared hit=1 > > -> Seq Scan on sources s > (cost=0.00..1.39 rows=39 width=8) (actual time=0.005..0.020 rows=56 > loops=1) > > Buffers: shared hit=1 > > Planning time: 0.599 ms > > Execution time: 22486.216 ms > > > > Then try commenting out only one line in the case when and the query run > 10x faster : > > > > Subquery Scan on wv (cost=6.00..28598.17 rows=100000 width=36) (actual > time=0.839..2460.002 rows=100000 loops=1) > > Buffers: shared hit=3136 > > -> Limit (cost=6.00..11598.17 rows=100000 width=58) (actual > time=0.210..112.043 rows=100000 loops=1) > > Buffers: shared hit=3136 > > -> Hash Join (cost=6.00..1069811.24 rows=9228690 width=58) > (actual time=0.209..99.513 rows=100000 loops=1) > > Hash Cond: (wv_1.ga_id = ga.ga_id) > > Buffers: shared hit=3136 > > -> Seq Scan on web_visits wv_1 (cost=0.00..877005.78 > rows=20587078 width=50) (actual time=0.004..14.048 rows=100000 loops=1) > > Buffers: shared hit=3133 > > -> Hash (cost=5.50..5.50 rows=40 width=12) (actual > time=0.184..0.184 rows=111 loops=1) > > Buckets: 1024 Batches: 1 Memory Usage: 5kB > > Buffers: shared hit=3 > > -> Hash Join (cost=1.88..5.50 rows=40 width=12) > (actual time=0.058..0.146 rows=111 loops=1) > > Hash Cond: (ga.dwh_source_id = s.dwh_source_id) > > Buffers: shared hit=3 > > -> Seq Scan on google_analytics ga > (cost=0.00..2.89 rows=89 width=8) (actual time=0.007..0.025 rows=111 > loops=1) > > Buffers: shared hit=2 > > -> Hash (cost=1.39..1.39 rows=39 width=8) > (actual time=0.042..0.042 rows=56 loops=1) > > Buckets: 1024 Batches: 1 Memory Usage: > 3kB > > Buffers: shared hit=1 > > -> Seq Scan on sources s > (cost=0.00..1.39 rows=39 width=8) (actual time=0.006..0.021 rows=56 > loops=1) > > Buffers: shared hit=1 > > Planning time: 0.583 ms > > Execution time: 2467.484 ms > > > > Why this drop in performance for only one (in this simple example) > condition ? I do not really understand it. If I add more conditions to the > query (let say 1 or 2) it is also getting slower. And it’s not a few ms, it > is around 5 sec or so. (which is huge considering I only take in my example > 1/500 of my data with LIMIT. > > > > Before we deviate from the problem I have (which is why the sudden drop of > performance) let me clarify a few things about this query : > > - The purpose is not to rewrite it, with a join or whatever, the > case when actually comes from a function which is auto-generated by another > app we have > > - My example is pretty simple and regex expressions could be > replaced by equals, the real case when query contains way more complicated > regex > > - This is subset of my CASE WHEN, it is much bigger, I cut it at > the “bottleneck” point for this post. > > > > Thanks a lot. > > > > Best Regards, > > > > Kevin > > >