Question: consolidating strpos searches?

James Addison Sat, 04 Jan 2025 09:16:46 -0800

Hello and Happy New Year,

I'd like to validate or reject a performance-related PostgreSQL patch
idea, regarding multiple strpos calls that operate on a common text
input.


>From local testing using PGSQL 17.2, and searching the mailing lists,
my understanding is that each function expression (including any that
are duplicates) in a SQL query is evaluated independently.

To check that, I constructed a pessimal demonstration (psql session
snippet attached here for reference) by generating an extremely long
random text blob ending with the phrase "known suffix".

Running queries against that text locally I encountered the runtimes
(annotated in the SQL comments) below:

  WHERE strpos(v, 'known') > 0; -- 2 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0; -- 4 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND
strpos(v, 'suffix') > 0; -- 6 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND
strpos(v, 'suffix') > 0 AND strpos(v, 'suffix') > 0; -- 8 seconds

In other words: each additional strpos(value, ...) expression
increased the evaluation time by a similar, significant duration of
two seconds.  This seems to confirm the basis that each expression is
currently evaluated separately, by an independent read from the input
text.

I'd like to suggest the introduction of a documented multiple string
matching algorithm[1], to yield results for each of multiple strpos
calls while only reading through their common input text at-most once.

In the context of considering writing a patch: would the complexity of
implementing such a feature for PostgreSQL be worth the potential
performance benefits?  And either way, is there more I should learn
about and consider?  How would I provide convincing supporting
evidence if I do write a patch?

Thank you,
James

[1] Such as those described in "Flexible Pattern Matching in Strings",
authored by G Navarro and M Raffinot, 2002, published by Cambridge
University Press

testing=> CREATE EXTENSION pgcrypto;
CREATE EXTENSION
testing=> CREATE TEMPORARY TABLE test AS SELECT 
string_agg(gen_random_bytes(1024)::text, '') || 'known suffix' AS value FROM 
generate_series(1, 512*512);
SELECT 1
testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0;
                                                 QUERY PLAN                     
                             
-------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=31.53..31.54 rows=1 width=8) (actual time=2113.529..2113.530 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..30.40 rows=453 width=0) (actual 
time=2086.105..2113.521 rows=1 loops=1)
         Filter: (strpos(value, 'known'::text) > 0)
 Planning Time: 0.205 ms
 Execution Time: 2113.600 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0 AND strpos(value, 'suffix') > 0;
                                                 QUERY PLAN                     
                             
-------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=37.58..37.59 rows=1 width=8) (actual time=4196.635..4196.636 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..37.20 rows=151 width=0) (actual 
time=4140.200..4196.626 rows=1 loops=1)
         Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 
'suffix'::text) > 0))
 Planning Time: 0.111 ms
 Execution Time: 4196.683 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0;
                                                              QUERY PLAN        
                                                      
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=44.38..44.39 rows=1 width=8) (actual time=6269.283..6269.284 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..44.00 rows=151 width=0) (actual 
time=6179.754..6269.275 rows=1 loops=1)
         Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 
'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0))
 Planning Time: 0.101 ms
 Execution Time: 6269.329 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0 
AND strpos(value, 'suffix') > 0;
                                                                                
  QUERY PLAN                                                                    
              
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=51.18..51.19 rows=1 width=8) (actual time=8190.171..8190.172 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..50.80 rows=151 width=0) (actual 
time=8072.736..8190.162 rows=1 loops=1)
         Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 
'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 
'suffix'::text) > 0))
 Planning Time: 0.118 ms
 Execution Time: 8190.222 ms
(5 rows)

Question: consolidating strpos searches?

Reply via email to