Re: Can Postgres beat Oracle for regexp_count?

2022-02-03 Thread Merlin Moncure
On Wed, Feb 2, 2022 at 4:26 PM Tom Lane  wrote:
>
> "David G. Johnston"  writes:
> > Given we don't have a regexp_count function this isn't surprising...
>
> FYI, it's there in HEAD.
>
> In the meantime, you could possibly do something like
>
> =# select count(*) from regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 
> 'g');
>  count
> ---
>  2
> (1 row)

alternate version:
postgres=# select array_upper(regexp_split_to_array('My High Street My
High Street', 'My High Street'), 1) - 1;
 ?column?
──
2

can help to slide this into complex queries a little bit easier by
avoiding the aggregation :-).

merlin




Re: Can Postgres beat Oracle for regexp_count?

2022-02-03 Thread Shaozhong SHI
Hi, David,

Many thanks.

I am investigating into transformation of data quality validation through
automation with application of Postgres/PostGIS.

Regards,

David

On Thu, 3 Feb 2022 at 13:00, David G. Johnston 
wrote:

>
>
> On Thursday, February 3, 2022, Shaozhong SHI 
> wrote:
>
>>
>> Is it correct to say that this ?: construction of a regex can be applied
>> for checking whether cell values meet specifications?
>>
>>>
>>>
> It does exactly what our examples shows it does.  I don’t understand what
> you mean above but if that helps you remember its purpose, great.
>
> David J.
>
>


Re: Can Postgres beat Oracle for regexp_count?

2022-02-03 Thread David G. Johnston
On Thursday, February 3, 2022, Shaozhong SHI  wrote:

>
> Is it correct to say that this ?: construction of a regex can be applied
> for checking whether cell values meet specifications?
>
>>
>>
It does exactly what our examples shows it does.  I don’t understand what
you mean above but if that helps you remember its purpose, great.

David J.


Re: Can Postgres beat Oracle for regexp_count?

2022-02-03 Thread Shaozhong SHI
Many thanks, Tom,

select regexp_matches('My High Street', '(?:[A-Z][a-z]+[\s]*)+', 'g');
looks very interesting.

I did read the documentation, but found it is difficult to read.
Particularly, the documentation on the use ?: does not state clear sense.
There is only limited explanation on ?:.

Is it correct to say that this ?: construction of a regex can be applied
for checking whether cell values meet specifications?

Regards,

David

On Thu, 3 Feb 2022 at 05:59, Tom Lane  wrote:

> Shaozhong SHI  writes:
> > The following has been attempted but no luck.
>
> > select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
> > It is intended to match 'My High Street, but it turned out only 'Street'
> > was matched.
>
> You've got the parentheses in the wrong place, ie inside not outside the
> "+" quantifier.  Per the fine manual [1], the result is determined by the
> last match of quantified capturing parens.
>
> You could avoid using any capturing parens, so that the result is
> the whole match:
>
> regression=# select regexp_matches('My High Street',
> '(?:[A-Z][a-z]+[\s]*)+', 'g');
>regexp_matches
> 
>  {"My High Street"}
> (1 row)
>
> or you could do
>
> regression=# select regexp_matches('My High Street',
> '(([A-Z][a-z]+[\s]*)+)', 'g');
>   regexp_matches
> ---
>  {"My High Street",Street}
> (1 row)
>
> but then you have two sets of capturing parens and you get results for
> both, so you might prefer
>
> regression=# select regexp_matches('My High Street',
> '((?:[A-Z][a-z]+[\s]*)+)', 'g');
>regexp_matches
> 
>  {"My High Street"}
> (1 row)
>
> In any case, there's no substitute for reading the manual.
>
> regards, tom lane
>
> [1]
> https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP
>


Re: Can Postgres beat Oracle for regexp_count?

2022-02-02 Thread Tom Lane
Shaozhong SHI  writes:
> The following has been attempted but no luck.

> select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
> It is intended to match 'My High Street, but it turned out only 'Street'
> was matched.

You've got the parentheses in the wrong place, ie inside not outside the
"+" quantifier.  Per the fine manual [1], the result is determined by the
last match of quantified capturing parens.

You could avoid using any capturing parens, so that the result is
the whole match:

regression=# select regexp_matches('My High Street', '(?:[A-Z][a-z]+[\s]*)+', 
'g');
   regexp_matches   

 {"My High Street"}
(1 row)

or you could do

regression=# select regexp_matches('My High Street', '(([A-Z][a-z]+[\s]*)+)', 
'g');
  regexp_matches   
---
 {"My High Street",Street}
(1 row)

but then you have two sets of capturing parens and you get results for
both, so you might prefer

regression=# select regexp_matches('My High Street', '((?:[A-Z][a-z]+[\s]*)+)', 
'g');
   regexp_matches   

 {"My High Street"}
(1 row)

In any case, there's no substitute for reading the manual.

regards, tom lane

[1] 
https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP




Re: Can Postgres beat Oracle for regexp_count?

2022-02-02 Thread David G. Johnston
On Wed, Feb 2, 2022 at 10:26 PM Shaozhong SHI 
wrote:

>
> select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
> It is intended to match 'My High Street, but it turned out only 'Street'
> was matched.
>
>
I'm too tired to find the documentation for why you saw your result but
basically you only have a single capturing parentheses pair and since
you've quantified that you end up with just the last capture that was found
- Street.  If you want to capture the entire found expression you need to
capture the quantifier.  So put parentheses around the entire regexp.

select regexp_matches('My High Street', '(([A-Z][a-z]+[\s]*)+)', 'g')

You now have a two element array, slots filled left-to-right based upon the
opening parenthesis.  So {"My High Street",Street}

To get rid of the undesired Street and only return a single element array
you need to make the inner parentheses non-capturing.

select regexp_matches('My High Street', '((?:[A-Z][a-z]+[\s]*)+)', 'g')

David J.


Re: Can Postgres beat Oracle for regexp_count?

2022-02-02 Thread Shaozhong SHI
Hi, Tom, Lane,

On Wed, 2 Feb 2022 at 22:26, Tom Lane  wrote:

> "David G. Johnston"  writes:
> > Given we don't have a regexp_count function this isn't surprising...
>
> FYI, it's there in HEAD.
>
> In the meantime, you could possibly do something like
>
> =# select count(*) from regexp_matches('My High Street',
> '([A-Z][a-z]+[\s])', 'g');
>  count
> ---
>  2
> (1 row)
>
> (Note that 2 is the correct answer given that there's no space
> after the third word; I trust Oracle agrees.)
>
> Can the whole 3 or 4 or 5 to be matched as 1?
>

The following has been attempted but no luck.

select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
It is intended to match 'My High Street, but it turned out only 'Street'
was matched.

Regards,  David


Re: Can Postgres beat Oracle for regexp_count?

2022-02-02 Thread Tom Lane
"David G. Johnston"  writes:
> Given we don't have a regexp_count function this isn't surprising...

FYI, it's there in HEAD.

In the meantime, you could possibly do something like

=# select count(*) from regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 
'g');
 count 
---
 2
(1 row)

(Note that 2 is the correct answer given that there's no space
after the third word; I trust Oracle agrees.)

regards, tom lane




Re: Can Postgres beat Oracle for regexp_count?

2022-02-02 Thread David G. Johnston
On Wed, Feb 2, 2022 at 1:20 PM Shaozhong SHI  wrote:

> It has been found that regexp_count works brilliantly in Oracle.
>

What query exactly did you execute in Oracle that you wish to see if an
equivalent can be formulated in PostgreSQL?

>
> However, it is not easy to replicate that in Postgres.
>

Given we don't have a regexp_count function this isn't surprising...



>   The following codes have been experimented but without any luck.
>
> select regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g')
>
> select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)', 'g')
>

See my first point.

David J.


Can Postgres beat Oracle for regexp_count?

2022-02-02 Thread Shaozhong SHI
It has been found that regexp_count works brilliantly in Oracle.

However, it is not easy to replicate that in Postgres.  The following codes
have been experimented but without any luck.

select regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g')

select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)', 'g')

County occurrences of 'My High Street' in one of the following strings:

'My High Street'1
'' 0
'My High Street My High Street'   2

Can anyone enlighten all of us?

Regards,

David