[Python-ideas] Regex timeouts

2022-02-15 Thread Ben Rudiak-Gould
On Tue, Feb 15, 2022 at 6:13 PM Chris Angelico  wrote:

> Once upon a time, a "regular expression" was a regular grammar. That is no
> longer the case.
>

I use "regex" for the weird backtracking minilanguages and deliberately
never call them "regular expressions". (I was under the impression that the
Perl documentation observed the same convention but that doesn't seem to be
true.)

Once upon a time, a regular expression could be broadly compatible with
> multiple different parser engines.
>

I think there never was such a time, at least not if you wanted syntactic
compatibility.

Is there any sort of standardization of regexp syntax and semantics[...]?
>

I'm not sure there needs to be. There is no standardization of
programming-language syntax in general, unless you count conventions like
{...} for blocks which Python ignores.

The problem as I see it isn't that the syntax isn't standardized. It's that
the syntax, to the extent it is standardized, is terrible. The only
traditional Unix tool whose regular expression syntax isn't godawful is
lex, and unfortunately that isn't the one that caught on.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/GSYIG5EKDFPD6NOJCS3LXQJVFFGNBMED/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread David Mertz, Ph.D.
I know this is probably too much self promotion, but I really enjoyed
writing this less than a year ago: https://gnosis.cx/regex/ (The Puzzling
Quirks of Regular Expressions).

It's like other puzzle books, but for programmers. You should certainly
still get Friedl's book if you don't have it. You can read mine online in a
couple versions. But the printed one is nice looking, and the artwork shows
better).

In particular, I'd love to send signed copies to any of the folks with
familiar names here. Maybe pay me for shipping, but you don't have to
(within the US, media mail is cheap, elsewhere in the world is expensive,
unfortunately). Or buy from Lulu if you don't care about autograph... Or
read for free, of course.

I get a little bit into the obscure theory of this thread. But mostly I
think you'll just laugh, and experience occasional confusion.

On Tue, Feb 15, 2022, 9:46 PM Tim Peters  wrote:

> [Chris Angelico ]
> > Is there any sort of standardization of regexp syntax and semantics,
>
> Sure. "The nice thing about standards is that you have so many to
> choose from" ;-) For example, POSIX defines a regexp flavor so it can
> specify what things like grep do. The ECMAScruot standard defines its
> own standard, ditto Java, etc.
>
>
> > or does everyone just extend it in their own directions, borrowing
> > ideas from each other to give some not-always-false assurance of
> > compatibility?
>
> In real life, everyone strives to copy what Perl does, because regexps
> are ubiquitous in Perl and Larry Wall worked hard at the time to put
> in every useful regexp feature everyone else already had, but with
> more uniform syntax. Perl's love of regexps strikes me as clinically
> pathological, but that doesn't diminish my respect for the relative
> sanity Perl brought to this area.
>
> There's a little bit flowing _into_ Perl too. An example close to my
> heart: Guido and I obtained Larry's promise that he'd never use a (?P
> prefix, so that Python could use that for its own purposes. Which
> amounted to introducing the concept of named groups. Which Perl in
> turn later adopted - although Perl dropped the "P" for named groups.
>
> Ah - I see MRAB replied while I was typing this, saying much the same.
> But I'm worider, so I won't waste the eff\ort ;-)
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/4SRZOOINXWEKCPNZDX2362O5P7XCGKMK/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Z3TYWEXCJWUWB7F6E4G2BDM3LBJYAOO7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Tim Peters
[Chris Angelico ]
> Is there any sort of standardization of regexp syntax and semantics,

Sure. "The nice thing about standards is that you have so many to
choose from" ;-) For example, POSIX defines a regexp flavor so it can
specify what things like grep do. The ECMAScruot standard defines its
own standard, ditto Java, etc.


> or does everyone just extend it in their own directions, borrowing
> ideas from each other to give some not-always-false assurance of
> compatibility?

In real life, everyone strives to copy what Perl does, because regexps
are ubiquitous in Perl and Larry Wall worked hard at the time to put
in every useful regexp feature everyone else already had, but with
more uniform syntax. Perl's love of regexps strikes me as clinically
pathological, but that doesn't diminish my respect for the relative
sanity Perl brought to this area.

There's a little bit flowing _into_ Perl too. An example close to my
heart: Guido and I obtained Larry's promise that he'd never use a (?P
prefix, so that Python could use that for its own purposes. Which
amounted to introducing the concept of named groups. Which Perl in
turn later adopted - although Perl dropped the "P" for named groups.

Ah - I see MRAB replied while I was typing this, saying much the same.
But I'm worider, so I won't waste the eff\ort ;-)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/4SRZOOINXWEKCPNZDX2362O5P7XCGKMK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread MRAB

On 2022-02-16 02:11, Chris Angelico wrote:

On Wed, 16 Feb 2022 at 12:56, Tim Peters  wrote:

Regexps keep "evolving"...


Once upon a time, a "regular expression" was a regular grammar. That
is no longer the case.

Once upon a time, a regular expression could be broadly compatible
with multiple different parser engines. That is being constantly
eroded.

So far, I think they still count as expressions. That's about all we
can depend on.

Is there any sort of standardization of regexp syntax and semantics,
or does everyone just extend it in their own directions, borrowing
ideas from each other to give some not-always-false assurance of
compatibility?

The only regex standard I know of is the POSIX standard, but I don't 
know of a common implementation that follows it. Most tend to follow 
Perl, although Perl borrowed named groups from Python, though with a 
slightly different syntax ("(?...)" instead of "(?P...)").

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/K6V5QD2VJ5CFPPUX6GVZDJSPLAVL4H5H/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 12:56, Tim Peters  wrote:
> Regexps keep "evolving"...

Once upon a time, a "regular expression" was a regular grammar. That
is no longer the case.

Once upon a time, a regular expression could be broadly compatible
with multiple different parser engines. That is being constantly
eroded.

So far, I think they still count as expressions. That's about all we
can depend on.

Is there any sort of standardization of regexp syntax and semantics,
or does everyone just extend it in their own directions, borrowing
ideas from each other to give some not-always-false assurance of
compatibility?

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VBHGZGINQ4H2F5WBI3RLF2UOGGOHMDKH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Tim Peters
[Steven D'Aprano ]
> After this thread, I no longer trust that "easy" regexes will do what
> they "obviously" look like they should do :-(
>
> I'm not trying to be funny or snarky. I *thought* I had a reasonable
> understanding of regexes, and now I have learned that I don't, and that
> the regexes I've been writing don't do what I thought they did, and
> presumedly the only reason they haven't blown up in my face (either
> performance-wise, or the wrong output) is blind luck.

Reading Friedl's book is a cure for the confusion, but not for the angst ;-)

I believe the single most practical addition in recent decades has
been the introduction of "possessive quantifiers" This is a variant of
the "greedy" quantifiers that does what most people at the start
_believe_ they do: one-and-done. After its initial match, backtracking
into it fails. So, e.g., \s++ matches the longest string of whitespace
at the time, period. Why "++"? Regexps ;-) It's essentially gibberish
syntax that previously didn't have a sensible meaning.

For example,

>>> regex.search("^x+[a-z]{4}k", "xk")


is what we're used to if we're paying attention: sucking up as many
x's as possible fails to match (there's nothing for [a-z]{4} to match
except the trailing "k"). But we keep backtracking into it, trying to
match one less "x" at a time, until [a-z]{4} finally matches the
rightmost 4 x's.

But make it possessive and the match as a whole  fails right away:

>>> regex.search("^x++[a-z]{4}k", "xk")

++ refuses to give back any of what it matched the first time.

At this point there are probably more regexp engines that support this
feature than don't. Python's re does not, but the regex extension
does., Cutting unwanted chances for backtracking greatly cuts the
chance of stumbling into timing disasters.

Where does that leave Python:? Pretty much aging itself into
obsolescence. Regexps keep "evolving", it appears Fredrik lost
interest in keeping up long before he died, and nobody else has
stepped up. regex _has_ kept up, but isn't in the core. So "install
regex" is ever more the best advice.

Note that just slamming possessive quantifiers into CPython's engine
isn't a good approach for more than just the obvious reasons:
possessive quantifiers are themselves just syntax sugar (or chili
peppers) for one instance of a more general new feature, "atomic
groups". Another that's all but a de facto industry standard now,
which Python's re doesn't support (but regex does). Putting just part
of that in is half-assed.


> Now I have *three* problems :-(

You're quite welcome ;-)
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/SRB4XQMSUX5VCEJDTMOESD4E5ROQTAZN/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano  wrote:
>
> On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
>
> > scanf just isn't powerful enough.  For example, consider parsing user
> > input dates: scanf("%d/%d/%d", , , ).  This is nice and
> > simple, but handling "2022-02-15" as well requires a bit of thinking
> > and several extra statements in C.  In Python, I guess it would
> > probably look something like
> >
> > year, sep1, month, sep2, day = scanf("%d%c%d%c%d")
> > if not ('/' == sep1 == sep2 or '-' == sep1 == sep2):
> > raise DateFormatUnacceptableError
> > # range checks for month and day go here
>
> Assuming that scanf raises if there is no match, I would probably go
> with:

Having scanf raise is one option; another option would be to have it
return a partial result, which would raise ValueError when unpacked in
this simple way. (Partial results are FAR easier to debug than a
simple "didn't match", plus they can be extremely useful in some
situations.)

> try:
> # Who writes ISO-8601 dates using slashes?
> day, month, year = scanf("%d/%d/%d")
> if ALLOW_TWO_DIGIT_YEARS and len(year) == 2:
> year = "20" + year
> except ScanError:
> year, month, day = scanf("%d-%d-%d")

It all depends on what your goal is. Do you want to support multiple
different formats (d/m/y, y-m-d, etc)? Do you want one format with
multiple options for delimiter? Is it okay if someone mismatches
delimiters?

Most likely, I'd not care if someone uses y/m-d, but I wouldn't allow
d/m/y or m/d/y, so I'd write it like this:

year, month, day = scanf("%d%*[-/]%d%*[-/]%d")

But realistically, if we're doing actual ISO 8601 date parsing, then
*not one of these is correct*, and we should be using an actual ISO
8601 library :) The simple cases like log file parsing are usually
consuming the output of exactly one program, so you can mandate the
delimiter completely. Here's something that can parse the output of
'git blame':

commit, name, y,m,d, h,m,s, tz, line, text = \
scanf("%s (%s %d-%d-%d %d:%d:%d %d %d) %s")

(Of course, you should use --porcelain instead, but this is an example.)

There's a spectrum of needs, and a spectrum of tools that can fulfil
them. At one extreme, simple method calls, the "in" operator, etc -
very limited, very fast, easy to read. At the other extreme, full-on
language parsers with detailed grammars. In between? Well, sscanf is a
bit simpler than regexp, REXX's parse is probably somewhere near
sscanf, SNOBOL is probably a bit to the right of regexp, etc, etc,
etc. We shouldn't have to stick to a single tool just because it's
capable of spanning a wide range.

> I think that
>
> year, sep1, month, sep2, day = 
> re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups()
>
> might do it (until Tim or Chris tell me that actually is wrong).
>
> Or use \2 as you suggest later on.

Yeah, \2 much more clearly expresses the intent of "take either of
these characters, and then match another of that character".

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/PKQXJNACY3RMI4DAN2OTQDBLPUMSLZ67/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 09:28, Steven D'Aprano  wrote:
>
> On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote:
>
> > Yeah, regexes always look terrible when they're used for simple
> > examples :) But try matching a line that has (somewhere in it) the
> > word "spam", then whitespace, then a number (or if you prefer: then a
> > sequence of ASCII digits). It's easy to write "spam\s+[0-9]+"
>
> After this thread, I no longer trust that "easy" regexes will do what
> they "obviously" look like they should do :-(
>
> I'm not trying to be funny or snarky.

(That must be rare!)

> I *thought* I had a reasonable
> understanding of regexes, and now I have learned that I don't, and that
> the regexes I've been writing don't do what I thought they did, and
> presumedly the only reason they haven't blown up in my face (either
> performance-wise, or the wrong output) is blind luck.
>
> Now I have *three* problems :-(
>

I think it's one of those cases where it normally doesn't matter that
they don't technically do quite what you thought. Pretending that a
regex matches in a simpler way than it actually does is like
pretending that the earth is a sphere: technically wrong, but almost
always close enough. It's only in the rare cases that it matters, and
they usually only show up with the regexps that are so complicated
that I wouldn't trust them to not be buggy anyway. (Debugging a regexp
is a PAIN, when your main response is just "nope didn't match".)

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/G4T3ID4SPHRJXWTBOYSHUF7NOZODYKFO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Steven D'Aprano
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:

> scanf just isn't powerful enough.  For example, consider parsing user
> input dates: scanf("%d/%d/%d", , , ).  This is nice and
> simple, but handling "2022-02-15" as well requires a bit of thinking
> and several extra statements in C.  In Python, I guess it would
> probably look something like
> 
> year, sep1, month, sep2, day = scanf("%d%c%d%c%d")
> if not ('/' == sep1 == sep2 or '-' == sep1 == sep2):
> raise DateFormatUnacceptableError
> # range checks for month and day go here

Assuming that scanf raises if there is no match, I would probably go 
with:

try:
# Who writes ISO-8601 dates using slashes?
day, month, year = scanf("%d/%d/%d")
if ALLOW_TWO_DIGIT_YEARS and len(year) == 2:
year = "20" + year
except ScanError:
year, month, day = scanf("%d-%d-%d")


> which isn't too bad, though.  But
> 
> year, month, day = re.match(r"(\d+)[-/](\d+)[-/](\d+)").groups()
> if not sep1 == sep2:
> raise DateFormatUnacceptableError
> # range checks for month and day go here

Doesn't that raise an exception?

NameError: name 'sep1' is not defined

I think that 

year, sep1, month, sep2, day = 
re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups()

might do it (until Tim or Chris tell me that actually is wrong).

Or use \2 as you suggest later on.

> expresses the intent a lot more clearly, I think.

N, I don't think it does. The scanf (hypothetical) solution is a lot 
closer to my intent.

But yes, regexes are more powerful: you can implement scanf using 
regexes, but you can't implement regexes using scanf.


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/XFEHTMTCAETLTOJXF2WRXIERRON5EH5M/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] New convenience attribute pathlib.Path.stems

2022-02-15 Thread Clay Gerrard
>>> from pathlib import Path
>>> p = Path('/etc/swift/object.ring.gz')
>>> p.suffix
'.gz'
>>> p.suffixes
['.ring', '.gz']
>>> p.stem
'object.ring'
>>> p.stems
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'PosixPath' object has no attribute 'stems'

I think it would have been nice if .stems = ['object', '.ring']

The idiomatic answer for this persons definition of "true" stem:

https://stackoverflow.com/questions/31890341/clean-way-to-get-the-true-stem-of-a-path-object

... seemed to [ab]use with_suffix(‘’), but .stems[0] might be more obvious.

What do you think?

-- 
Clay Gerrard
-- 
Clay Gerrard
210 788 9431
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/JATPO7Y4KIOJ76ZN5HOV3CP7QLKXOXIE/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Steven D'Aprano
On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote:

> Yeah, regexes always look terrible when they're used for simple
> examples :) But try matching a line that has (somewhere in it) the
> word "spam", then whitespace, then a number (or if you prefer: then a
> sequence of ASCII digits). It's easy to write "spam\s+[0-9]+" 

After this thread, I no longer trust that "easy" regexes will do what 
they "obviously" look like they should do :-(

I'm not trying to be funny or snarky. I *thought* I had a reasonable 
understanding of regexes, and now I have learned that I don't, and that 
the regexes I've been writing don't do what I thought they did, and 
presumedly the only reason they haven't blown up in my face (either 
performance-wise, or the wrong output) is blind luck.

Now I have *three* problems :-(


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FEBOS7QOJZSSLZGMVLPHQQM4QWDBMD46/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread J.B. Langston
How embarassing... I apologize for all the signature garbage at the end of my 
message.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MLN44XBJ7SYZQJ5PLWJ65SHJ5X75R44R/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull
 wrote:
> The Zawinski quote is motivated by the perception that people seem to
> think that simplicity lies in minimizing the number of tools you need
> to learn.  REXX and SNOBOL pattern matching quite a bit more
> specialized to particular tools than regexps.  That is, all regexp
> implementations support the same basic language which is sufficient
> for most tasks most programmers want regexps for.
>

The problem is that that's an illusion. If you restrict yourself to
the subset that's supported by every regexp implementation, you'll
quickly find tasks that you can't handle. If you restrict yourself to
what you THINK is the universal subset, you end up with something that
has a subtle difference when you use it somewhere else (I've had this
problem with grep and Python, where a metacharacter in one was a plain
character in the other - also frequently a problem between grep and
sed, with the consequent "what do I need to escape?" problem).

But as the OP has found, regexps are a hammer that, for some nail-like
problems, will whack an arbitrary number of times before hitting. So I
guess the question isn't "why are regular expressions so popular" but
"why are other things not ALSO popular". I honestly think that scanf
parsing, if implemented ad-hoc by different programming languages and
extended to their needs, would end up no less different from each
other than different regexp engines are - the most-used parts would
also be the most-compatible, just like with regexps.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/WWUPMW5O7YY6DB67ROOTIYLMS2HIFC4C/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread J.B. Langston
Tim Peters wrote:
> """
> Some people, when confronted with a problem, think “I know, I'll use
> regular expressions.”  Now they have two problems.
> - Jamie Zawinski
> """

Maybe so, but I'm committed now :).  I have dozens of regexes to parse specific 
log messages I'm interested in. I made a little DSL that uses regexes with 
capture groups, and if the regex matches, takes the resulting groupdict and 
optionally applies further transformations on the individual fields. This 
allows me to very concisely specify what I want to extract before doing further 
analysis and aggregation on the resulting fields.  For example:

flush_end = Rule(
Capture(
# Completed flushing 
/u01/data02/tb_tbi_project02_prd/data_launch_index-4a5f72725b7211eaab635720a1b8a299/aa-26507-bti-Data.db
 (46.528MiB) for commitlog position CommitLogPosition(segmentId=1615955816662, 
position=223538288)
# Completed flushing 
/dse/data02/OpsCenter/rollup_state-7b621931ab7511e8b862810a639403e5/bb-21969-bti-Data.db
 (7.763MiB/2.197MiB on disk/1 files) for commitlog position 
CommitLogPosition(segmentId=1637403836277, position=9927158)
r"Completed flushing (?P[^ ]+) 
\((?P[^)/]+)(/(?P[^ ]+) on disk/(?P[^ 
]+) files)?\) for commitlog position 
CommitLogPosition\(segmentId=(?P[^,]+), 
position=(?P[^)]+)\)"
),
Convert(
normval,
"bytes_flushed",
"bytes_on_disk",
"commitlog_segment",
"commitlog_position",
),
table_from_sstable,
)

I know there are specialized tools like logstash but it's nice to be able to 
specify the extraction and subsequent analysis together in Python. 

> reason to change that. Naive regexps are both clumsy and prone to bad
> timing in many tasks that "should be" very easy to express. For
> example, "now match up to the next occurrence of 'X'". In SNOBOL and
> Icon, that's trivial. 75% of regexp users will write ".*X", with scant
> understanding that it may match wy more than they intended.
> Another 20% will write ".*?X", with scant understanding that may
> extend beyond _just_ "the next" X in some cases. That leaves the happy
> 5% who write "[^X]*X", which finally says what they intended from the
> start.

If you look in my regex in the example above, you will see that the "[^X]*X" is 
exactly what I did. The pathological case arose from a simple typo where I had 
an extra + after a capture group that I failed to notice, and which somehow 
worked correctly on the expected input but ran forever when the expected 
terminating character appeared more times than expected in the input string.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/USLCQSN6WARWTWJI5LATPS3DZMAYDM5S/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread J.B. Langston
>
> A regex that's vulnerable to pathological behavior is a DoS attack waiting
>> to happen. Especially when used for parsing log data (which might contain
>> untrusted data). If possible, we should make it harder for people to shoot
>> themselves in the feet.
>>
>
And this is exactly what happened to me. I have a job that
automatically parses logs as they are uploaded, and a log came in that had
an unexpected pattern that triggered pathological behavior in my regex that
did not occur when processing the expected input.  This caused the import
pipeline to back up for many hours before I noticed and fixed it.


> While definitely not as bad and not as likely as SQL injection, I think
> the possibility of regex DoS is totally missing in the stdlib re docs.
> Should there be something added there about if you need to put user input
> into an expression, best practice is to re.escape it?
>

Unless I am missing something, I don't see how re.escape would have helped
me here. I wasn't trying to treat arbitrary input as a regex, so escaping
the regex characters in it wouldn't have done anything to help me. The
problem is that a regex *that I wrote* had a bug in it that caused
pathological behavior, but it wasn't found during testing because it only
occurred when matching against an unexpected input.

-- 
[image: DataStax Logo Square]  *J.B. Langston*
Tech Support
Tools Wrangler
+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

   [image: Facebook Logo]

   [image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]



On Mon, Feb 14, 2022 at 3:59 PM Nick Timkovich 
wrote:

> A regex that's vulnerable to pathological behavior is a DoS attack waiting
>> to happen. Especially when used for parsing log data (which might contain
>> untrusted data). If possible, we should make it harder for people to shoot
>> themselves in the feet.
>>
>
> While definitely not as bad and not as likely as SQL injection, I think
> the possibility of regex DoS is totally missing in the stdlib re docs.
> Should there be something added there about if you need to put user input
> into an expression, best practice is to re.escape it?
>
>

-- 
[image: DataStax Logo Square]  *J.B. Langston*
Tech Support
Tools Wrangler
+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

   [image: Facebook Logo]

   [image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RCK6Z2PRBH6NRFWBSRVZJ2CSTEPKK2VF/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread MRAB

On 2022-02-15 06:05, Tim Peters wrote:

[Steven D'Aprano ]

I've been interested in the existence of SNOBOL string scanning for
a long time, but I know very little about it.

How does it differ from regexes, and why have programming languages
pretty much standardised on regexes rather than other forms of string
matching?


What we call "regexps" today contain all sorts of things that aren't
in the original formal definition of "regular expressions". For
example, even the ubiquitous "^" and "$" (start- and end-of-line
assertions) go beyond what the phrase formally means.

So the question is ill-defined. When Perl added recursive regular
expressions, I'm not sure there's any real difference in theoretical
capability remaining. Without that, though, and for example, you can't
write a regular expression that matches strings with balanced
parentheses ("regexps can't count"), while I earlier posted a simple
2-liner in SNOBOL that implements such a thing (patterns in SNOBOL can
freely invoke other patterns, including themselves).

As to why regexps prevailed, traction! They are useful tools, and
_started_ life as pretty simple things, with small, elegant, and
efficient implementations Feature creep and "faster! faster! faster!"
turned the implementations more into bottomless pits now ;-)

Adoption breeds more adoption in the computer world. They have no real
competition anymore. The same sociological illness has also cursed us,
e.g., with an eternity of floating point signed zeroes ;--)

Chris didn't say this, but I will: I'm amazed that things much
_simpler_ than regexps, like his scanf and REXX PARSE
examples,,haven't spread more. Simple solutions to simple problems are
very appealing to me. Although, to be fair, I get a kick too out of
massive overkill ;l-)

Regexes were simple to start with, so only a few metacharacters were 
needed, the remaining characters being treated as literals.


As new features were added, the existing metacharacters were used in new 
ways that had been illegal until then in order to remain 
backwards-compatible.


Add to that that there are multiple implementations with differing (and 
sometimes only slightly differing) features and behaviours.


It's a good example of evolution: often messy, and resulting in clunky 
designs.

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/NU3IQUNERTUQICFZT4XIR3MFY6LJV2NS/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Stephen J. Turnbull
Tim Peters writes:

 > Chris didn't say this, but I will: I'm amazed that things much
 > _simpler_ than regexps, like his scanf and REXX PARSE
 > examples, haven't spread more.

scanf just isn't powerful enough.  For example, consider parsing user
input dates: scanf("%d/%d/%d", , , ).  This is nice and
simple, but handling "2022-02-15" as well requires a bit of thinking
and several extra statements in C.  In Python, I guess it would
probably look something like

year, sep1, month, sep2, day = scanf("%d%c%d%c%d")
if not ('/' == sep1 == sep2 or '-' == sep1 == sep2):
raise DateFormatUnacceptableError
# range checks for month and day go here

which isn't too bad, though.  But

year, month, day = re.match(r"(\d+)[-/](\d+)[-/](\d+)").groups()
if not sep1 == sep2:
raise DateFormatUnacceptableError
# range checks for month and day go here

expresses the intent a lot more clearly, I think.  Sure, it's easy to
write uninterpretable regexps, but up to that point regexps are very
expressive.  And that example can be reduced to one line (plus the
comment) at the expense of a less symmetric, slightly less readable
expression like r"(\d+)([-/])(\d+)\2(\d+)".  Some folks might like
that one better.

 > Simple solutions to simple problems are very appealing to me.

The Zawinski quote is motivated by the perception that people seem to
think that simplicity lies in minimizing the number of tools you need
to learn.  REXX and SNOBOL pattern matching quite a bit more
specialized to particular tools than regexps.  That is, all regexp
implementations support the same basic language which is sufficient
for most tasks most programmers want regexps for.

I think you'd need to implement such a facility in a very popular
scripting language such as sh, Perl, or Python for it to have the
success of regexps.

 > Although, to be fair, I get a kick too out of massive overkill ;l-)

Don't we all, though?

Steve

___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/JQ5MXGTDATG5WUYMOQ64TRPBAIGSLMNL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Chris Angelico
On Wed, 16 Feb 2022 at 00:55, Steven D'Aprano  wrote:
>
> On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote:
>
> > ([^s]|s(?!pam))*spam
> >
> > Bingo. That pattern is easy enough to understand
>
> You and I have very different definitions of the word "easy" :-)
>
> > (if not to invent the
> > first time): we can chew up a character if it's not an "s", or if it
> > is an "s" but one _not_ followed immediately by "pam".
>
> It is times like this that I am reminded why I prefer to just call
> string.find("spam") :-)
>

Yeah, regexes always look terrible when they're used for simple
examples :) But try matching a line that has (somewhere in it) the
word "spam", then whitespace, then a number (or if you prefer: then a
sequence of ASCII digits). It's easy to write "spam\s+[0-9]+" and not
nearly as easy to write it with method calls. So it makes sense that,
when you add a restriction like "the word spam must be the first
instance of that in the line" (maybe not common with words, but it
certainly would be if you're scanning for a colon or other separator),
it should still be written that way.

To be honest, I don't think I've ever used method calls for
complicated parsing. It's just way too messy. Much easier to reach for
a regex, sscanf pattern, or other tool - even if it's not technically
perfect.

ChrisA
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/WGWQ372YU5QBTOBI6CMSVWGM5TNENZLL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Steven D'Aprano
On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote:

> ([^s]|s(?!pam))*spam
> 
> Bingo. That pattern is easy enough to understand

You and I have very different definitions of the word "easy" :-)

> (if not to invent the
> first time): we can chew up a character if it's not an "s", or if it
> is an "s" but one _not_ followed immediately by "pam".

It is times like this that I am reminded why I prefer to just call 
string.find("spam") :-)


-- 
Steve
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6TFZUC4UWH2GJDYD3QBH76FRAVRNOYNL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Regex timeouts

2022-02-15 Thread Tim Peters
[Tim, on trying to match only the next instance of "spam"]
> Assertions aren't needed, but it is nightmarish to get right.

Followed by a nightmare that got it wrong. My apologies - that's what
I get for trying to show off ;-)

It's actually far easier if assertions are used, and I'm too old to
bother trying to repair the non-assertion mess:

([^s]|s(?!pam))*spam

Bingo. That pattern is easy enough to understand (if not to invent the
first time): we can chew up a character if it's not an "s", or if it
is an "s" but one _not_ followed immediately by "pam".

The "s" at the start of a matched "spam" doesn't satisfy either
alternative, so it can't go beyond the leftmost following "spam".

Of course the same trick can be used to find just the next occurrence
of any fixed string of at least two characters.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/H52GNFBDSQGLQCVB4NBQA2KNKVVVRLAV/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Re: Please consider mentioning property without setter when an attribute can't be set

2022-02-15 Thread André Roberge
On Fri, Feb 11, 2022 at 5:39 AM Steven D'Aprano  wrote:

> On Thu, Feb 10, 2022 at 02:27:42PM -0800, Neil Girdhar wrote:
>
> > AttributeError: can't set attribute 'f'
> >
> > This can be a pain to debug when the property is buried in a base class.
>
> > Would it make sense to mention the reason why the attribute can't be
> set,
> > namely that it's on a property without a setter?
>
> I have no objection to changing the error message, I'm sure it's a small
> enough change that you should just open a ticket on b.p.o. for it. But I
> don't expect that it will be particularly useful either.
>
> If you can't set an attribute on an object, aren't there three obvious
> causes to check?
>

obvious?  See below.

>
> - the object has no __dict__, and so has no attributes at all;
>   e.g. trying to set an attribute on a float;
>
> - the object has slots, but 'f' is not one of them;
>
> - or 'f' is a property with no setter (or a setter that raises
>   AttributeError).
>
> Have I missed any common cases?
>

>>> Point = namedtuple('point', ('x', 'y'))
>>> p = Point(2, 3)
>>> p.x = 4
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: can't set attribute

Would this qualify as "obvious"?



>
> Maybe reporting "can't set property 'f'" is good enough.
>
> +1
André

>
> --
> Steve
> ___
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/Z2AWWETWB72CKARR4DAHB3A2LFRLQR7X/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/O3HVFO37C4F4IFVU2YIFORGANPRBOJTS/
Code of Conduct: http://python.org/psf/codeofconduct/