Hi,

I suggest adding a new function, regexp_positions(),
which works exactly like regexp_matches(),
except it returns int4range[] start/end positions for *where* the matches 
occurs.

I first thought I could live without this function,
and just get the positions using strpos(),
but as Andreas Karlsson kindly helped me understand,
that naive idea doesn't always work.

Andreas provided this pedagogic example
to demonstrate the problem:

SELECT regexp_matches('caaabaaa', '(?<=b)(a+)', 'g');
regexp_matches
----------------
{aaa}
(1 row)

If we would try to use strpos() to find the position,
based on the returned matched substring,
we would get the wrong answer:

SELECT strpos('caaabaaa','aaa');
strpos
--------
      2
(1 row)

Sure, there is "aaa" at position 2,
but that's not where the match occurred,
since the (?<=b) means "positive lookbehind of the character b",
so the match actually occurred at position 6,
where there is a "b" before the "aaa".

Using regexp_positions(), we can now get the correct answer:

SELECT regexp_positions('caaabaaa', '(?<=b)(a+)', 'g');
regexp_positions
------------------
{"[6,9)"}
(1 row)

Some more examples from the regress/sql/strings.sql,
showing both regexp_matches() and regexp_positions()
for the same examples, as they both return the same structure,
but with different types:

SELECT regexp_matches('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 
'g');
regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)

SELECT regexp_positions('foobarbequebazilbarfbonk', $re$(b[^b]+)(b[^b]+)$re$, 
'g');
   regexp_positions
-----------------------
{"[4,7)","[7,12)"}
{"[12,17)","[17,21)"}
(2 rows)

I've added documentation and tests.

Forgive me for just sending a patch without much discussion on the list,
but it was so easy to implement, so I thought an implementation can
help the discussion on if this is something we want or not.

A few words on the implementation:
I copied build_regexp_match_result() to a new function 
build_regexp_positions_result(),
and removed the string parts, replacing it with code to make ranges instead.
Maybe there are common parts that could be put into some helper-function,
but I think not, since the functions are two small for it to be worth it.

Thanks to David Fetter for the idea on using ranges.

Based on HEAD (f5a5773a9dc4185414fe538525e20d8512c2ba35).

/Joel

Attachment: 0001-regexp-positions.patch
Description: Binary data

Reply via email to