[issue405358] Python2.0 re module: greedy regexp bug

2022-04-10 Thread admin


Change by admin :


--
github: None -> 34046

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue429357] non-greedy regexp duplicating match bug

2022-04-10 Thread admin


Change by admin :


--
github: None -> 34572

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue408936] Python2.0 re module: greedy regexp bug 2

2022-04-10 Thread admin


Change by admin :


--
github: None -> 34154

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue231635] ConfigParser module regexp issue

2022-04-10 Thread admin


Change by admin :


--
github: None -> 33890

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46410] TypeError when parsing regexp with unicode named character sequence escape

2022-03-19 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

>>> import unicodedata
>>> unicodedata.lookup('KEYCAP NUMBER SIGN')
'#️'
>>> print(ascii(unicodedata.lookup('KEYCAP NUMBER SIGN')))
'#\ufe0f\u20e3'

Support of Unicode Named Character Sequences in the unicodeescape codec and in 
the RE parser would be a new feature.

--
components: +Interpreter Core, Unicode
nosy: +serhiy.storchaka, vstinner
type: behavior -> enhancement
versions: +Python 3.11 -Python 3.10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46410] TypeError when parsing regexp with unicode named character sequence escape

2022-01-18 Thread Matthew Barnett


Matthew Barnett  added the comment:

They're not supported in string literals either:

Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> "\N{KEYCAP NUMBER SIGN}"
  File "", line 1
"\N{KEYCAP NUMBER SIGN}"
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-21: unknown Unicode character name

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46410] TypeError when parsing regexp with unicode named character sequence escape

2022-01-17 Thread Jirka Marsik


New submission from Jirka Marsik :

re.compile(r"\N{name of Unicode Named Character Sequence}"), e.g. 
re.compile(r"\N{KEYCAP NUMBER SIGN}"), throws a TypeError. The regular 
expression parser relies on 'unicodedata' to lookup character names. The 
'unicodedata' module recently added support for Unicode Named Character 
Sequences (https://www.unicode.org/Public/13.0.0/ucd/NamedSequences.txt). 
Trying to use these named character sequences in a regular expression leads to 
a 'TypeError', as the regexp parser tries to call 'ord' on a string with length 
> 1.

--
components: Regular Expressions
messages: 410770
nosy: ezio.melotti, jirkamarsik, mrabarnett
priority: normal
severity: normal
status: open
title: TypeError when parsing regexp with unicode named character sequence 
escape
type: behavior
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue46410>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2020-10-19 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

2.7 is end of life.

if you have regular expression performance issues with something in 3, please 
open a new issue.

--
nosy: +gregory.p.smith
resolution:  -> wont fix
stage:  -> resolved
status: pending -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-26 Thread Matthew Barnett


Matthew Barnett  added the comment:

That's what searching does!

Does the pattern match here? If not, advance by one character and try again. 
Repeat until a match is found or you've reached the end.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-25 Thread Leon Hampton


Leon Hampton  added the comment:

Matthew Barnett & SilentGhost,
Thank you for your prompt responses. (Really prompt. Amazing!)
SilentGhost,
Regarding your response, I used re.search, not re.match. When I used re.match, 
the regex failed. When I used re.search, it matched.
Here are my tests.

Your example (cut-and-pasted):
x = re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '|$)', '

I understand the re.match failing, since it always starts at the beginning of 
the string, but why did re.search succeed? After failing with the yes-pattern, 
when the regex engine backtracked to the (<)? did it decide not to match the 
'<' at all and skip the character? Seems like it. What do you think?

I am running Python 3.7 via Spyder 4.1.1 on Windows 10.

Respectfully,
Leon

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-22 Thread SilentGhost


SilentGhost  added the comment:

Leon, this most likely is not a bug, not because what's stated in 
documentation, but because you're most likely not testing what you think you 
do. Here is the test that you should be doing:

>>> re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '>>

No match. If there is a different output in your setup, please provide both the 
output and the details of your system and Python installation.

--
nosy: +SilentGhost
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-22 Thread Matthew Barnett


Matthew Barnett  added the comment:

The documentation is talking about whether it'll match at the current position 
in the string. It's not a bug.

--
resolution:  -> not a bug

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40043] RegExp Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

2020-03-22 Thread Leon Hampton


Leon Hampton  added the comment:

Hello,
There may be a bug in the implementation of the Conditional Construction of 
Regular Expressions, namely the (?(id/name)yes-pattern|no-pattern).
In the Regular Expression documentation 
(https://docs.python.org/3.7/library/re.html), in the portion about the 
Conditional Construct, it gives this sample pattern 
'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)' and states that the pattern WILL NOT MATCH 
this string ' RegExp 
Conditional Construct (?(id/name)yes-pattern|no-pattern) Problem

___
Python tracker 
<https://bugs.python.org/issue40043>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2019-09-09 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
status: open -> pending

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37687] Invalid regexp should rise exception

2019-07-25 Thread Matthew Barnett


Matthew Barnett  added the comment:

For historical reasons, if it isn't valid as a repeat then it's a literal. This 
is true in other regex implementations, and is by no means unique to the re 
module.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37687] Invalid regexp should rise exception

2019-07-25 Thread belegnar


New submission from belegnar :

`re.error` should be rised on `re.compile("string{data}")` because manual says 
only numbers are valid within `{}`

--
components: Regular Expressions
messages: 348458
nosy: belegnar, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: Invalid regexp should rise exception
versions: Python 3.6

___
Python tracker 
<https://bugs.python.org/issue37687>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35543] re.sub is only replacing max. of 2 string found by regexp.

2018-12-20 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

The third argument of re.sub() is the maximal number of replacements. re.I == 2.

sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the Match object and must return
a replacement string to be used.

--
nosy: +serhiy.storchaka
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35543] re.sub is only replacing max. of 2 string found by regexp.

2018-12-20 Thread Sagar


New submission from Sagar :

Below are the logs:

>>> dat = '"10GE" "4x" "AMPC" "B3" "BUILTIN" "DOWN" "LU" "SFP+" "ether" "xe" 
>>> "DOWN" "MPC" "BUILTIN"'
>>> type = 
>>> re.subn(r'\"BUILTIN\"|\"B\d\"|\"I\d\"|\"LU\"|\"Trinity\"|\"Trio\"|\"DOWN\"|\"UNKNOWN\"|'
... r'^AND$|\"Q\"|\"MPC\"|\"EA\d\"|\"3D\"', '', dat, re.I)
>>> type
('"10GE" "4x" "AMPC"   "DOWN" "LU" "SFP+" "ether" "xe" "DOWN" "MPC" "BUILTIN"', 
2)
>>> dat = '"10GE" "4x" "AMPC"   "DOWN" "LU" "SFP+" "ether" "xe" "DOWN" "MPC" 
>>> "BUILTIN"'
>>> type = 
>>> re.subn(r'\"BUILTIN\"|\"B\d\"|\"I\d\"|\"LU\"|\"Trinity\"|\"Trio\"|\"DOWN\"|\"UNKNOWN\"|'
... r'^AND$|\"Q\"|\"MPC\"|\"EA\d\"|\"3D\"', '', dat, re.I)
>>> type
('"10GE" "4x" "AMPC" "SFP+" "ether" "xe" "DOWN" "MPC" "BUILTIN"', 2)
>>> dat = '"10GE" "4x" "AMPC" "SFP+" "ether" "xe" "DOWN" "MPC" "BUILTIN"'
>>> type = 
>>> re.subn(r'\"BUILTIN\"|\"B\d\"|\"I\d\"|\"LU\"|\"Trinity\"|\"Trio\"|\"DOWN\"|\"UNKNOWN\"|'
... r'^AND$|\"Q\"|\"MPC\"|\"EA\d\"|\"3D\"', '', dat, re.I)
>>> type
('"10GE" "4x" "AMPC" "SFP+" "ether" "xe"   "BUILTIN"', 2)
>>> dat = '"10GE" "4x" "AMPC" "SFP+" "ether" "xe"   "BUILTIN"'
>>> type = 
>>> re.subn(r'\"BUILTIN\"|\"B\d\"|\"I\d\"|\"LU\"|\"Trinity\"|\"Trio\"|\"DOWN\"|\"UNKNOWN\"|'
... r'^AND$|\"Q\"|\"MPC\"|\"EA\d\"|\"3D\"', '', dat, re.I)
>>> type
('"10GE" "4x" "AMPC" "SFP+" "ether" "xe"   ', 1)
>>>

--
components: Library (Lib)
messages: 332198
nosy: saga
priority: normal
severity: normal
status: open
title: re.sub is only replacing max. of 2 string found by regexp.
type: behavior
versions: Python 3.5

___
Python tracker 
<https://bugs.python.org/issue35543>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Reg python regexp

2018-03-21 Thread Youta TAKAOKA
sankarramanv,

It seems for me that this task does not need both python AND shell. Only
python does it, as well as only shell.

Of course, there can be some restrictions let you use both. (the real world
is filled up with such troublesome matters !)
If you *really* need to use `lgrep`, try `-f` option.
`lgrep -f` uses pattern as just a fixed text, not regexp.

2018年3月21日(水) 20:35 Rhodri James <rho...@kynesim.co.uk>:

> On 21/03/18 10:44, sankarram...@gmail.com wrote:
> > Hi,
> >
> > I have a requirement.
> >
> > cmd="cat |grep -c 'if [ -t 1 ]; then mesg n 2>/dev/null; fi'"
> >
> > I need to escape only the square brackets in above variable since its
> not grepping without escaping the brackets.
>
> You need to escape the square brackets as you normally would for your
> shell, with backslashes I presume.  Then you need to escape the
> backslashes so they aren't interpreted specially by Python, with more
> backslashes.
>
> --
> Rhodri James *-* Kynesim Ltd
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reg python regexp

2018-03-21 Thread Rhodri James

On 21/03/18 10:44, sankarram...@gmail.com wrote:

Hi,

I have a requirement.

cmd="cat |grep -c 'if [ -t 1 ]; then mesg n 2>/dev/null; fi'"

I need to escape only the square brackets in above variable since its not 
grepping without escaping the brackets.


You need to escape the square brackets as you normally would for your 
shell, with backslashes I presume.  Then you need to escape the 
backslashes so they aren't interpreted specially by Python, with more 
backslashes.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Reg python regexp

2018-03-21 Thread Chris Angelico
On Wed, Mar 21, 2018 at 9:44 PM,   wrote:
> Hi,
>
> I have a requirement.
>
> cmd="cat |grep -c 'if [ -t 1 ]; then mesg n 2>/dev/null; fi'"
>
> I need to escape only the square brackets in above variable since its not 
> grepping without escaping the brackets.
>
> Please help.

You're putting this into a Python script. Why not use Python to search
the file instead of grep? That'd also eliminate the superfluous "cat
file |" at the start.

Python is not a shell language. You don't have to, and shouldn't,
write everything by invoking other programs.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reg python regexp

2018-03-21 Thread Paul Moore
Hi,
You don't need a regexp for this, the "replace" method on a string
will do what you want:

>>> s = 'this is a [string'
>>> print(s.replace('[', '\\['))
this is a \[string

Paul


On 21 March 2018 at 10:44,  <sankarram...@gmail.com> wrote:
> Hi,
>
> I have a requirement.
>
> cmd="cat |grep -c 'if [ -t 1 ]; then mesg n 2>/dev/null; fi'"
>
> I need to escape only the square brackets in above variable since its not 
> grepping without escaping the brackets.
>
> Please help.
>
> Thanks.
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Reg python regexp

2018-03-21 Thread sankarramanv
Hi,

I have a requirement.

cmd="cat |grep -c 'if [ -t 1 ]; then mesg n 2>/dev/null; fi'"

I need to escape only the square brackets in above variable since its not 
grepping without escaping the brackets.

Please help.

Thanks.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: RegExp - please help me!

2017-12-27 Thread szykcech
> (?s)struct (.+?)\s*\{\s*(.+?)\s*\};

Thank you Vlastimil Brom for regexp and for explanation!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: RegExp - please help me!

2017-12-27 Thread Lele Gaifax
szykc...@gmail.com writes:

> Please help me with this regexp or tell me that I neeed do this in other way.

I think that using regexps to parse those structures is fragile and difficult
to get right[0], as there are lots of corner cases (comments, complex types,
...).

I'd suggest using a tool designed to do that, for example pycparser[1], that
provides the required infrastructure to parse C units into an AST: from there
you can easily extract interesting pieces and write out in whatever format you
need.

As an example, I used it to extract[2] enums and defines from PostgreSQL C
headers[3] and rewrite them as Python definitions[4].

Good luck,
ciao, lele.

[0] http://regex.info/blog/2006-09-15/247
[1] https://github.com/eliben/pycparser
[2] https://github.com/lelit/pg_query/blob/master/tools/extract_enums.py
[3] 
https://github.com/lfittl/libpg_query/blob/43ce2e8cdf54e4e1e8b0352e37adbd72e568e100/src/postgres/include/nodes/parsenodes.h
[4] https://github.com/lelit/pg_query/blob/master/pg_query/enums/parsenodes.py
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: RegExp - please help me! (Posting On Python-List Prohibited)

2017-12-26 Thread szykcech
W dniu wtorek, 26 grudnia 2017 21:53:14 UTC+1 użytkownik Lawrence D’Oliveiro 
napisał:
> On Wednesday, December 27, 2017 at 2:15:21 AM UTC+13, szyk...@gmail.com wrote:
> > struct (.+)\s*{\s*(.+)\s*};
> 
> You realize that “.” matches anything? Whereas I think you want to match 
> non-whitespace in those places.

I realize that. I want skip white-spaces from the beginning and from the end 
and match entire body of C++ struct declaration (with white spaces inside as 
well). Maybe should I use "".strip(" ").strip("\t").strip("\n") function after 
matching?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: RegExp - please help me!

2017-12-26 Thread Peter Pearson
On Tue, 26 Dec 2017 05:14:55 -0800 (PST), szykc...@gmail.com wrote:
[snip]
> So: I develop regexp which to my mind should work, but it doesn't and
> I don't know why. The broken regexp is like this: 
> struct (.+)\s*{\s*(.+)\s*};
[snip]

You'll probably get better help faster if you can present your problem
as a couple lines of code, and ask "Why does this print XXX, when I'm
expecting it to print YYY?"  (Sorry I'm not smart enough to give you
an answer to your actual question.)

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: RegExp - please help me!

2017-12-26 Thread Vlastimil Brom
Hi,
I can't comment, whether this is the right approach, as I have no
experiences with C++, but for the matching regular expression itself,
I believe, e.g. the following might work for your sample data (I am
not sure, however, what other details or variants should be taken into
account):

(?s)struct (.+?)\s*\{\s*(.+?)\s*\};

i.e. you need to escape the metacharacters { } with a backslash \
and in the current form of the pattern, you also need the flag DOTALL
- set via (?s) in the example above - in order to also match newlines
with . (alternatively, you could use \n specifically in the pattern,
where needed.) it is possible, that an online regex tester uses some
flags implicitly.

I believe, the non-greedy quantifiers are suitable here  +?
matching as little as possible, otherwise the pattern would match
between the first and the last structs in the source text at once.

It seems, the multiline flag is not needed here, as there are no
affected metacharacters.

hth,
   vbr

=

2017-12-26 14:14 GMT+01:00, szykc...@gmail.com <szykc...@gmail.com>:
> Hi
> I use online Python reg exp editor https://pythex.org/ and I use option
> "multiline".
> I want to use my reg exp in Python script to generate automatically some
> part of my program written in C++ (database structure and serialization
> functions). In order to do this I need: 1) C++ struct name and 2) struct
> definition. Struct definition I need because some inline functions can
> appear bellow my struct definition and makes inappropriate further regexp
> filtering (against variables).
>
> So: I develop regexp which to my mind should work, but it doesn't and I
> don't know why. The broken regexp is like this:
> struct (.+)\s*{\s*(.+)\s*};
> As you can see it has two groups: struct name and struct definition.
> It fails even for such simple structure:
> struct Structure
> {
> int mVariable1;
> QString mVariable2;
> bool mVariable3
> };
>
> Please help me with this regexp or tell me that I neeed do this in other
> way.
>
> thanks, happy Christmas, and happy New Year
> Szyk Cech
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


RegExp - please help me!

2017-12-26 Thread szykcech
Hi
I use online Python reg exp editor https://pythex.org/ and I use option 
"multiline".
I want to use my reg exp in Python script to generate automatically some part 
of my program written in C++ (database structure and serialization functions). 
In order to do this I need: 1) C++ struct name and 2) struct definition. Struct 
definition I need because some inline functions can appear bellow my struct 
definition and makes inappropriate further regexp filtering (against variables).

So: I develop regexp which to my mind should work, but it doesn't and I don't 
know why. The broken regexp is like this:
struct (.+)\s*{\s*(.+)\s*};
As you can see it has two groups: struct name and struct definition.
It fails even for such simple structure:
struct Structure
{
int mVariable1;
QString mVariable2;
bool mVariable3
};

Please help me with this regexp or tell me that I neeed do this in other way.

thanks, happy Christmas, and happy New Year
Szyk Cech
-- 
https://mail.python.org/mailman/listinfo/python-list


How to use a regexp here

2017-12-08 Thread Cecil Westerhof
I have a script that was running perfectly for some time. It uses:
array = [elem for elem in output if 'CPU_TEMP' in elem]

But because output has changed, I have to check for CPU_TEMP at the beginning
of the line. What would be the best way to implement this?

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Cecil Westerhof
Rick Johnson  writes:

>> There is now also a line that starts with: PCH_CPU_TEMP:
>> And I do not want that one.
>
> Yes. But be aware, that while the `str.startswith(target)`
> method is indeed more efficient than a more generalized
> "inclusion test", if the target is not _always_ at the
> beginning of the string, then your code is going to skip
> right over valid match like a round stone skipping across
> the surface of a glass-smooth lake. But if you are sure the
> target will always be at the beginning of the string, then
> it is the best choice.

Yes, I am sure it is always at the beginning of the line. (It is output from
the Linux sensors command.)

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread breamoreboy
On Monday, December 4, 2017 at 9:44:27 AM UTC, Cecil Westerhof wrote:
> I have a script that was running perfectly for some time. It uses:
> array = [elem for elem in output if 'CPU_TEMP' in elem]
>
> But because output has changed, I have to check for CPU_TEMP at the
> beginning of the line. What would be the best way to implement this?
>
> --
> Cecil Westerhof
> Senior Software Engineer
> LinkedIn: http://www.linkedin.com/in/cecilwesterhof

Use https://docs.python.org/3/library/stdtypes.html#str.startswith instead of
the test for `in`.

--
Kindest regards.

Mark Lawrence.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Cecil Westerhof
Neil Cerutti  writes:

> On 2017-12-04, Cecil Westerhof  wrote:
>> Joel Goldstick  writes:
>>
>>> On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
>>> wrote:
>>>
 On 12/4/17 4:36 AM, Cecil Westerhof wrote:

> I have a script that was running perfectly for some time. It uses:
>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>
> But because output has changed, I have to check for CPU_TEMP at the
> beginning of the line. What would be the best way to implement this?
>
>
 No need for a regex just yet:

 array = [elem for elem in output if elem.startswith('CPU_TEMP')]

 (btw, note that the result of this expression is a list, not an array, for
 future Googling.)

 --Ned.
 --
 https://mail.python.org/mailman/listinfo/python-list

>>>
>>> I like Ned's clear answer, but I'm wondering why the original code would
>>> fail because the substring is at the start of the line, since 'in' would
>>> still be true no matter where the desired string is placed.  It would be
>>> useful to see some sample data of the old data, and the new data
>>
>> There is now also a line that starts with:
>> PCH_CPU_TEMP:
>>
>> And I do not want that one.
>
> You'll probably want to include the ':' in the startswith check,
> in case someday they also add CPU_TEMP_SOMETHING:.

I already did. And to be really sure also included a space after it.

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Neil Cerutti
On 2017-12-04, Cecil Westerhof  wrote:
> Joel Goldstick  writes:
>
>> On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
>> wrote:
>>
>>> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>>>
 I have a script that was running perfectly for some time. It uses:
  array = [elem for elem in output if 'CPU_TEMP' in elem]

 But because output has changed, I have to check for CPU_TEMP at the
 beginning of the line. What would be the best way to implement this?


>>> No need for a regex just yet:
>>>
>>> array = [elem for elem in output if elem.startswith('CPU_TEMP')]
>>>
>>> (btw, note that the result of this expression is a list, not an array, for
>>> future Googling.)
>>>
>>> --Ned.
>>> --
>>> https://mail.python.org/mailman/listinfo/python-list
>>>
>>
>> I like Ned's clear answer, but I'm wondering why the original code would
>> fail because the substring is at the start of the line, since 'in' would
>> still be true no matter where the desired string is placed.  It would be
>> useful to see some sample data of the old data, and the new data
>
> There is now also a line that starts with:
> PCH_CPU_TEMP:
>
> And I do not want that one.

You'll probably want to include the ':' in the startswith check, in case
someday they also add CPU_TEMP_SOMETHING:.

--
Neil Cerutti

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Terry Reedy
On 12/4/2017 11:14 AM, Ned Batchelder wrote:
> On 12/4/17 9:13 AM, Rick Johnson wrote:
>> Perhaps it's not politically correct for me to say this, but
>> i've never been one who cared much about political
>> correctness, so i'm just going to say it...
>
> Cecil, feel free to ignore the rest of Rick's message.â  His messages are
> famous for their outrageous and/or abrasive tone, something he seems to
> revel in.â  Luckily, it's not typical of the Python community.

Or take Rick's 'rest' as a suggestion to reread Library Reference chapters 2,
3, 4 and in particular 4.7.

As for your idea of an RE, '^' matches the beginning of a line, and '$' the
end, though using .startswith, and .endswith, are easier if no other RE syntax
is needed for matching.

--
Terry Jan Reedy

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Ned Batchelder
On 12/4/17 9:13 AM, Rick Johnson wrote:
> Perhaps it's not politically correct for me to say this, but
> i've never been one who cared much about political
> correctness, so i'm just going to say it...

Cecil, feel free to ignore the rest of Rick's message.â  His messages are
famous for their outrageous and/or abrasive tone, something he seems to revel
in.â  Luckily, it's not typical of the Python community.

--Ned.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Joel Goldstick
On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder  wrote:

> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>
>> I have a script that was running perfectly for some time. It uses:
>>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>>
>> But because output has changed, I have to check for CPU_TEMP at the
>> beginning of the line. What would be the best way to implement this?
>>
>>
> No need for a regex just yet:
>
> array = [elem for elem in output if elem.startswith('CPU_TEMP')]
>
> (btw, note that the result of this expression is a list, not an array, for
> future Googling.)
>
> --Ned.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

I like Ned's clear answer, but I'm wondering why the original code would fail
because the substring is at the start of the line, since 'in' would still be
true no matter where the desired string is placed.  It would be useful to see
some sample data of the old data, and the new data

--
Joel Goldstick
http://joelgoldstick.com/blog
http://cc-baseballstats.info/stats/birthdays

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Rick Johnson
Cecil Westerhof wrote:
> Joel Goldstick writes:

[...]

> > I like Ned's clear answer, but I'm wondering why the
> > original code would fail because the substring is at the
> > start of the line, since 'in' would still be true no
> > matter where the desired string is placed.  It would be
> > useful to see some sample data of the old data, and the
> > new data

@Goldstick

"Inclusion testing" will return false positives when the target is part of a
larger structure (aka: word). Observe:

>>> s = "Complex is better than complicated."
>>> "plex" in s
True
>>> "om" in s
True
>>> s.count("om")
2

I'm sure you already know this, and only made the comment because you did not
have all the data, but i thought it would be good to mention for any lurkers
who may be watching.

> There is now also a line that starts with: PCH_CPU_TEMP:
> And I do not want that one.

Yes. But be aware, that while the `str.startswith(target)` method is indeed
more efficient than a more generalized "inclusion test", if the target is not
_always_ at the beginning of the string, then your code is going to skip right
over valid match like a round stone skipping across the surface of a
glass-smooth lake. But if you are sure the target will always be at the
beginning of the string, then it is the best choice.

> --
> Cecil Westerhof
> Senior Software Engineer

Perhaps it's not politically correct for me to say this, but i've never been
one who cared much about political correctness, so i'm just going to say it...

If you really are a "_Senior_ software engineer", and that title is not simply
an ego-booster bestowed by your boss to a one-person-dev-team in order to avoid
 pay raises, then i would expect more competence from someone who holds such an
 esteemed title.

And even *IF* you are only vaguely familiar with Python, and even *IF*, you
rarely use Python in your projects, i don't think it's too much to ask of a
~~Senior~~ Software Engineer that they possess the basic skills required to
peruse the Python documentation and decide which method is most appropriate for
 the situation at hand. And if you're using Python on a regular basis, then you
 should be intimately familiar with _all_ methods of each major type.

Granted, your question did "hint" about the possibility of using a regexp
(although, based on the data you have provided so far, a string method will
suffice), but i would also expect a ~~Senior~~ Software Engineer to not only be
 knowledgeable of regexps, but also know when they are a strength and when they
 are a weakness.

Now, there are one of two ways you can take this advice:

(1) You can take it as a personal attack; get all huffy
about it; drop to the floor and flail your arms and legs
like a petulant two-year-old who didn't get the toy he
wanted; and learn nothing in the process.

or

(2) You can take it as what it is -> constructive criticism;
shower me with gratitude[1]; and become a better person and
a better programmer in the process.

The choice is yours.


[1] Well, i had to sneak something in there for myself, after all, it is the
season of giving, yes? O:-)

Here comes santa claws...
Here comes santa claws...
...

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Cecil Westerhof
Joel Goldstick  writes:

> On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
> wrote:
>
>> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>>
>>> I have a script that was running perfectly for some time. It uses:
>>>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>>>
>>> But because output has changed, I have to check for CPU_TEMP at the
>>> beginning of the line. What would be the best way to implement this?
>>>
>>>
>> No need for a regex just yet:
>>
>> array = [elem for elem in output if elem.startswith('CPU_TEMP')]
>>
>> (btw, note that the result of this expression is a list, not an array, for
>> future Googling.)
>>
>> --Ned.
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>
> I like Ned's clear answer, but I'm wondering why the original code would
> fail because the substring is at the start of the line, since 'in' would
> still be true no matter where the desired string is placed.  It would be
> useful to see some sample data of the old data, and the new data

There is now also a line that starts with:
PCH_CPU_TEMP:

And I do not want that one.

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Ned Batchelder
On 12/4/17 4:36 AM, Cecil Westerhof wrote:
> I have a script that was running perfectly for some time. It uses:
>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>
> But because output has changed, I have to check for CPU_TEMP at the
> beginning of the line. What would be the best way to implement this?
>

No need for a regex just yet:

 â â â  array = [elem for elem in output if elem.startswith('CPU_TEMP')]

(btw, note that the result of this expression is a list, not an array, for
future Googling.)

--Ned.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-08 Thread Cecil Westerhof
Ned Batchelder  writes:

> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>> I have a script that was running perfectly for some time. It uses:
>>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>>
>> But because output has changed, I have to check for CPU_TEMP at the
>> beginning of the line. What would be the best way to implement this?
>>
>
> No need for a regex just yet:
>
> â â â  array = [elem for elem in output if elem.startswith('CPU_TEMP')]

Yes, that is it. I should have known that. :'-(

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Cecil Westerhof
Neil Cerutti  writes:

> On 2017-12-04, Cecil Westerhof  wrote:
>> Joel Goldstick  writes:
>>
>>> On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
>>> wrote:
>>>
 On 12/4/17 4:36 AM, Cecil Westerhof wrote:

> I have a script that was running perfectly for some time. It uses:
>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>
> But because output has changed, I have to check for CPU_TEMP at the
> beginning of the line. What would be the best way to implement this?
>
>
 No need for a regex just yet:

 array = [elem for elem in output if elem.startswith('CPU_TEMP')]

 (btw, note that the result of this expression is a list, not an array, for
 future Googling.)

 --Ned.
 --
 https://mail.python.org/mailman/listinfo/python-list

>>>
>>> I like Ned's clear answer, but I'm wondering why the original code would
>>> fail because the substring is at the start of the line, since 'in' would
>>> still be true no matter where the desired string is placed.  It would be
>>> useful to see some sample data of the old data, and the new data
>>
>> There is now also a line that starts with:
>> PCH_CPU_TEMP:
>>
>> And I do not want that one.
>
> You'll probably want to include the ':' in the startswith check,
> in case someday they also add CPU_TEMP_SOMETHING:.

I already did. And to be really sure also included a space after it.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Cecil Westerhof
Rick Johnson  writes:

>> There is now also a line that starts with: PCH_CPU_TEMP:
>> And I do not want that one.
>
> Yes. But be aware, that while the `str.startswith(target)`
> method is indeed more efficient than a more generalized
> "inclusion test", if the target is not _always_ at the
> beginning of the string, then your code is going to skip
> right over valid match like a round stone skipping across
> the surface of a glass-smooth lake. But if you are sure the
> target will always be at the beginning of the string, then
> it is the best choice.

Yes, I am sure it is always at the beginning of the line. (It is
output from the Linux sensors command.)

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Neil Cerutti
On 2017-12-04, Cecil Westerhof  wrote:
> Joel Goldstick  writes:
>
>> On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
>> wrote:
>>
>>> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>>>
 I have a script that was running perfectly for some time. It uses:
  array = [elem for elem in output if 'CPU_TEMP' in elem]

 But because output has changed, I have to check for CPU_TEMP at the
 beginning of the line. What would be the best way to implement this?


>>> No need for a regex just yet:
>>>
>>> array = [elem for elem in output if elem.startswith('CPU_TEMP')]
>>>
>>> (btw, note that the result of this expression is a list, not an array, for
>>> future Googling.)
>>>
>>> --Ned.
>>> --
>>> https://mail.python.org/mailman/listinfo/python-list
>>>
>>
>> I like Ned's clear answer, but I'm wondering why the original code would
>> fail because the substring is at the start of the line, since 'in' would
>> still be true no matter where the desired string is placed.  It would be
>> useful to see some sample data of the old data, and the new data
>
> There is now also a line that starts with:
> PCH_CPU_TEMP:
>
> And I do not want that one.

You'll probably want to include the ':' in the startswith check,
in case someday they also add CPU_TEMP_SOMETHING:.

-- 
Neil Cerutti

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Terry Reedy

On 12/4/2017 11:14 AM, Ned Batchelder wrote:

On 12/4/17 9:13 AM, Rick Johnson wrote:

Perhaps it's not politically correct for me to say this, but
i've never been one who cared much about political
correctness, so i'm just going to say it...


Cecil, feel free to ignore the rest of Rick's message.  His messages are 
famous for their outrageous and/or abrasive tone, something he seems to 
revel in.  Luckily, it's not typical of the Python community.


Or take Rick's 'rest' as a suggestion to reread Library Reference 
chapters 2, 3, 4 and in particular 4.7.


As for your idea of an RE, '^' matches the beginning of a line, and '$' 
the end, though using .startswith, and .endswith, are easier if no other 
RE syntax is needed for matching.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Ned Batchelder

On 12/4/17 9:13 AM, Rick Johnson wrote:

Perhaps it's not politically correct for me to say this, but
i've never been one who cared much about political
correctness, so i'm just going to say it...


Cecil, feel free to ignore the rest of Rick's message.  His messages are 
famous for their outrageous and/or abrasive tone, something he seems to 
revel in.  Luckily, it's not typical of the Python community.


--Ned.
--
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Rick Johnson
Cecil Westerhof wrote:
> Joel Goldstick writes:

[...]

> > I like Ned's clear answer, but I'm wondering why the
> > original code would fail because the substring is at the
> > start of the line, since 'in' would still be true no
> > matter where the desired string is placed.  It would be
> > useful to see some sample data of the old data, and the
> > new data

@Goldstick 

"Inclusion testing" will return false positives when the
target is part of a larger structure (aka: word). Observe:

>>> s = "Complex is better than complicated."
>>> "plex" in s
True
>>> "om" in s
True
>>> s.count("om")
2

I'm sure you already know this, and only made the comment
because you did not have all the data, but i thought it would
be good to mention for any lurkers who may be watching.

> There is now also a line that starts with: PCH_CPU_TEMP:
> And I do not want that one.

Yes. But be aware, that while the `str.startswith(target)`
method is indeed more efficient than a more generalized
"inclusion test", if the target is not _always_ at the
beginning of the string, then your code is going to skip
right over valid match like a round stone skipping across
the surface of a glass-smooth lake. But if you are sure the
target will always be at the beginning of the string, then
it is the best choice.

> -- 
> Cecil Westerhof
> Senior Software Engineer

Perhaps it's not politically correct for me to say this, but
i've never been one who cared much about political
correctness, so i'm just going to say it...

If you really are a "_Senior_ software engineer", and that
title is not simply an ego-booster bestowed by your boss to
a one-person-dev-team in order to avoid pay raises, then i
would expect more competence from someone who holds such an
esteemed title.

And even *IF* you are only vaguely familiar with Python, and
even *IF*, you rarely use Python in your projects, i don't
think it's too much to ask of a ~~Senior~~ Software Engineer
that they possess the basic skills required to peruse the
Python documentation and decide which method is most
appropriate for the situation at hand. And if you're using
Python on a regular basis, then you should be intimately
familiar with _all_ methods of each major type. 

Granted, your question did "hint" about the possibility of
using a regexp (although, based on the data you have
provided so far, a string method will suffice), but i would
also expect a ~~Senior~~ Software Engineer to not only be
knowledgeable of regexps, but also know when they are a
strength and when they are a weakness.

Now, there are one of two ways you can take this advice:

(1) You can take it as a personal attack; get all huffy
about it; drop to the floor and flail your arms and legs
like a petulant two-year-old who didn't get the toy he
wanted; and learn nothing in the process.

or

(2) You can take it as what it is -> constructive criticism;
shower me with gratitude[1]; and become a better person and
a better programmer in the process.

The choice is yours. 


[1] Well, i had to sneak something in there for myself,
after all, it is the season of giving, yes? O:-)

Here comes santa claws...
Here comes santa claws...
...

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Cecil Westerhof
Joel Goldstick  writes:

> On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
> wrote:
>
>> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>>
>>> I have a script that was running perfectly for some time. It uses:
>>>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>>>
>>> But because output has changed, I have to check for CPU_TEMP at the
>>> beginning of the line. What would be the best way to implement this?
>>>
>>>
>> No need for a regex just yet:
>>
>> array = [elem for elem in output if elem.startswith('CPU_TEMP')]
>>
>> (btw, note that the result of this expression is a list, not an array, for
>> future Googling.)
>>
>> --Ned.
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>
> I like Ned's clear answer, but I'm wondering why the original code would
> fail because the substring is at the start of the line, since 'in' would
> still be true no matter where the desired string is placed.  It would be
> useful to see some sample data of the old data, and the new data

There is now also a line that starts with:
PCH_CPU_TEMP:

And I do not want that one.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Cecil Westerhof
Ned Batchelder  writes:

> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>> I have a script that was running perfectly for some time. It uses:
>>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>>
>> But because output has changed, I have to check for CPU_TEMP at the
>> beginning of the line. What would be the best way to implement this?
>>
>
> No need for a regex just yet:
>
>     array = [elem for elem in output if elem.startswith('CPU_TEMP')]

Yes, that is it. I should have known that. :'-(

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Joel Goldstick
On Mon, Dec 4, 2017 at 5:21 AM, Ned Batchelder 
wrote:

> On 12/4/17 4:36 AM, Cecil Westerhof wrote:
>
>> I have a script that was running perfectly for some time. It uses:
>>  array = [elem for elem in output if 'CPU_TEMP' in elem]
>>
>> But because output has changed, I have to check for CPU_TEMP at the
>> beginning of the line. What would be the best way to implement this?
>>
>>
> No need for a regex just yet:
>
> array = [elem for elem in output if elem.startswith('CPU_TEMP')]
>
> (btw, note that the result of this expression is a list, not an array, for
> future Googling.)
>
> --Ned.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

I like Ned's clear answer, but I'm wondering why the original code would
fail because the substring is at the start of the line, since 'in' would
still be true no matter where the desired string is placed.  It would be
useful to see some sample data of the old data, and the new data

-- 
Joel Goldstick
http://joelgoldstick.com/blog
http://cc-baseballstats.info/stats/birthdays
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread Ned Batchelder

On 12/4/17 4:36 AM, Cecil Westerhof wrote:

I have a script that was running perfectly for some time. It uses:
 array = [elem for elem in output if 'CPU_TEMP' in elem]

But because output has changed, I have to check for CPU_TEMP at the
beginning of the line. What would be the best way to implement this?



No need for a regex just yet:

    array = [elem for elem in output if elem.startswith('CPU_TEMP')]

(btw, note that the result of this expression is a list, not an array, 
for future Googling.)


--Ned.
--
https://mail.python.org/mailman/listinfo/python-list


Re: How to use a regexp here

2017-12-04 Thread breamoreboy
On Monday, December 4, 2017 at 9:44:27 AM UTC, Cecil Westerhof wrote:
> I have a script that was running perfectly for some time. It uses:
> array = [elem for elem in output if 'CPU_TEMP' in elem]
> 
> But because output has changed, I have to check for CPU_TEMP at the
> beginning of the line. What would be the best way to implement this?
> 
> -- 
> Cecil Westerhof
> Senior Software Engineer
> LinkedIn: http://www.linkedin.com/in/cecilwesterhof

Use https://docs.python.org/3/library/stdtypes.html#str.startswith instead of 
the test for `in`.

--
Kindest regards.

Mark Lawrence.
-- 
https://mail.python.org/mailman/listinfo/python-list


How to use a regexp here

2017-12-04 Thread Cecil Westerhof
I have a script that was running perfectly for some time. It uses:
array = [elem for elem in output if 'CPU_TEMP' in elem]

But because output has changed, I have to check for CPU_TEMP at the
beginning of the line. What would be the best way to implement this?

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-10 Thread Larry Martell
On Thu, Aug 10, 2017 at 11:42 AM, alister via Python-list
<python-list@python.org> wrote:
> On Thu, 10 Aug 2017 09:38:49 -0400, Larry Martell wrote:
>
>> On Wed, Aug 9, 2017 at 8:33 PM, Cameron Simpson <c...@cskk.id.au> wrote:
>>> On 09Aug2017 10:46, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>>>
>>>> On 2017-08-09, Cameron Simpson <c...@cskk.id.au> wrote:
>>>>>
>>>>> On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>>>>>
>>>>>> ... but bear in mind, there have been ways of doing
>>>>>> denial-of-service attacks with valid-but-nasty regexps in the past,
>>>>>> and I wouldn't want to rely on there not being any now.
>>>>>
>>>>>
>>>>> The ones I've seen still require some input length (I'm thinking
>>>>> exponential rematch backoff stuff here). I suspect that if your test
>>>>> query matches the RE against a fixed empty string it is hard to be
>>>>> exploited. i.e. I think most of this stuff isn't expensive in terms
>>>>> of compiling the regexp but in executing it against text.
>>>>
>>>>
>>>> Well yes, but presumably if the OP is receiving regexps from users
>>>> they will be executed against text sooner or later.
>>>
>>>
>>> True, but the OP (Larry) was after validation.
>>>
>>> The risk then depends on the degree of trust in the user. If the user
>>> is a random person-from-the-internets, sure there's a risk there.
>>> However, if the regexp is part of some internal configuration being set
>>> up by trusted people (eg staff pursuing a goal) then validation will
>>> normally be enough.
>>>
>>> Of course, that is a call for Larry to make, not us, but it need to be
>>> bourne in mind by him.
>>
>> The input comes from in house people, not from the internet.
>
> The question would still be should the input be trusted & I would still
> say no, accidental errors can cause as much damage as malicious input if
> not correctly sanitised.

The regexp is used for a db query, worst that can happen is they run
some mega query that slows the db down. Which has happened. So we tell
them not do to that anymore.

Doctor it hurts when I do this.
Then don't do it.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-10 Thread alister via Python-list
On Thu, 10 Aug 2017 09:38:49 -0400, Larry Martell wrote:

> On Wed, Aug 9, 2017 at 8:33 PM, Cameron Simpson <c...@cskk.id.au> wrote:
>> On 09Aug2017 10:46, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>>
>>> On 2017-08-09, Cameron Simpson <c...@cskk.id.au> wrote:
>>>>
>>>> On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>>>>
>>>>> ... but bear in mind, there have been ways of doing
>>>>> denial-of-service attacks with valid-but-nasty regexps in the past,
>>>>> and I wouldn't want to rely on there not being any now.
>>>>
>>>>
>>>> The ones I've seen still require some input length (I'm thinking
>>>> exponential rematch backoff stuff here). I suspect that if your test
>>>> query matches the RE against a fixed empty string it is hard to be
>>>> exploited. i.e. I think most of this stuff isn't expensive in terms
>>>> of compiling the regexp but in executing it against text.
>>>
>>>
>>> Well yes, but presumably if the OP is receiving regexps from users
>>> they will be executed against text sooner or later.
>>
>>
>> True, but the OP (Larry) was after validation.
>>
>> The risk then depends on the degree of trust in the user. If the user
>> is a random person-from-the-internets, sure there's a risk there.
>> However, if the regexp is part of some internal configuration being set
>> up by trusted people (eg staff pursuing a goal) then validation will
>> normally be enough.
>>
>> Of course, that is a call for Larry to make, not us, but it need to be
>> bourne in mind by him.
> 
> The input comes from in house people, not from the internet.

The question would still be should the input be trusted & I would still 
say no, accidental errors can cause as much damage as malicious input if 
not correctly sanitised.

my experience with regex's is insufficient to help with any of the rest 
of this query




-- 
For some reason, this fortune reminds everyone of Marvin Zelkowitz.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-10 Thread Larry Martell
On Wed, Aug 9, 2017 at 8:33 PM, Cameron Simpson <c...@cskk.id.au> wrote:
> On 09Aug2017 10:46, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>
>> On 2017-08-09, Cameron Simpson <c...@cskk.id.au> wrote:
>>>
>>> On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>>>
>>>> ... but bear in mind, there have been ways of doing denial-of-service
>>>> attacks with valid-but-nasty regexps in the past, and I wouldn't want
>>>> to rely on there not being any now.
>>>
>>>
>>> The ones I've seen still require some input length (I'm thinking
>>> exponential
>>> rematch backoff stuff here). I suspect that if your test query matches
>>> the RE
>>> against a fixed empty string it is hard to be exploited. i.e. I think
>>> most of
>>> this stuff isn't expensive in terms of compiling the regexp but in
>>> executing it against text.
>>
>>
>> Well yes, but presumably if the OP is receiving regexps from users
>> they will be executed against text sooner or later.
>
>
> True, but the OP (Larry) was after validation.
>
> The risk then depends on the degree of trust in the user. If the user is a
> random person-from-the-internets, sure there's a risk there. However, if the
> regexp is part of some internal configuration being set up by trusted people
> (eg staff pursuing a goal) then validation will normally be enough.
>
> Of course, that is a call for Larry to make, not us, but it need to be
> bourne in mind by him.

The input comes from in house people, not from the internet.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-10 Thread Jon Ribbens
On 2017-08-10, Cameron Simpson <c...@cskk.id.au> wrote:
> On 09Aug2017 10:46, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>On 2017-08-09, Cameron Simpson <c...@cskk.id.au> wrote:
>>> On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>>>... but bear in mind, there have been ways of doing denial-of-service
>>>>attacks with valid-but-nasty regexps in the past, and I wouldn't want
>>>>to rely on there not being any now.
>>>
>>> The ones I've seen still require some input length (I'm thinking
>>> exponential rematch backoff stuff here). I suspect that if your
>>> test query matches the RE against a fixed empty string it is hard
>>> to be exploited. i.e. I think most of this stuff isn't expensive
>>> in terms of compiling the regexp but in executing it against text.
>>
>>Well yes, but presumably if the OP is receiving regexps from users
>>they will be executed against text sooner or later.
>
> True, but the OP (Larry) was after validation.
>
> The risk then depends on the degree of trust in the user. If the user is a 
> random person-from-the-internets, sure there's a risk there. However, if the 
> regexp is part of some internal configuration being set up by trusted people 
> (eg staff pursuing a goal) then validation will normally be enough.
>
> Of course, that is a call for Larry to make, not us, but it need to be bourne 
> in mind by him.

Yes... hence my mentioning it.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-09 Thread Cameron Simpson

On 09Aug2017 10:46, Jon Ribbens <jon+use...@unequivocal.eu> wrote:

On 2017-08-09, Cameron Simpson <c...@cskk.id.au> wrote:

On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:

... but bear in mind, there have been ways of doing denial-of-service
attacks with valid-but-nasty regexps in the past, and I wouldn't want
to rely on there not being any now.


The ones I've seen still require some input length (I'm thinking exponential
rematch backoff stuff here). I suspect that if your test query matches the RE
against a fixed empty string it is hard to be exploited. i.e. I think most of
this stuff isn't expensive in terms of compiling the regexp but in
executing it against text.


Well yes, but presumably if the OP is receiving regexps from users
they will be executed against text sooner or later.


True, but the OP (Larry) was after validation.

The risk then depends on the degree of trust in the user. If the user is a 
random person-from-the-internets, sure there's a risk there. However, if the 
regexp is part of some internal configuration being set up by trusted people 
(eg staff pursuing a goal) then validation will normally be enough.


Of course, that is a call for Larry to make, not us, but it need to be bourne 
in mind by him.


Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
--
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-09 Thread Larry Martell
On Wed, Aug 9, 2017 at 6:13 AM, Peter Heitzer
<peter.heit...@rz.uni-regensburg.de> wrote:
> Larry Martell <larry.mart...@gmail.com> wrote:
>>On Tue, Aug 8, 2017 at 12:51 PM, Chris Angelico <ros...@gmail.com> wrote:
>>> On Wed, Aug 9, 2017 at 2:37 AM, Larry Martell <larry.mart...@gmail.com> 
>>> wrote:
>>>> Anyone have any code or know of any packages for validating a regexp?
>>>>
>>>> I have an app that allows users to enter regexps for db searching.
>>>> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
>>>> causes downstream issues. I'd like to flag it at entry time.
>>>
>>> re.compile()? Although I'm not sure that 'A|B|' is actually invalid.
>>> But re.compile("(") throws.
>
>>Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
>>subexpression for regexp' I'd like to flag it before it gets to SQL.
>
> Then you need to do a real sql query with the regex and check if it throws.

Yes, that is what I ended up doing.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-09 Thread TheSeeker
On Tuesday, August 8, 2017 at 11:38:34 AM UTC-5, larry@gmail.com wrote:
> Anyone have any code or know of any packages for validating a regexp?
> 
> I have an app that allows users to enter regexps for db searching.
> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
> causes downstream issues. I'd like to flag it at entry time.

Hello,

IIRC, there is a built-in regexp builder/tester in Boa Constructor:
http://boa-constructor.sourceforge.net/

I used this a long time ago.

Duane
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-09 Thread Jon Ribbens
On 2017-08-09, Cameron Simpson <c...@cskk.id.au> wrote:
> On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:
>>On 2017-08-08, Chris Angelico <ros...@gmail.com> wrote:
>>> On Wed, Aug 9, 2017 at 2:57 AM, Larry Martell <larry.mart...@gmail.com> 
>>> wrote:
>>>> Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
>>>> subexpression for regexp' I'd like to flag it before it gets to SQL.
>>>
>>> Okay, so your definition of validity is "what MySQL will accept". In
>>> that case, I'd feed it to MySQL and see if it accepts it. Regexps are
>>> sufficiently varied that you really need to use the same engine for
>>> validation as for execution.
>>
>>... but bear in mind, there have been ways of doing denial-of-service
>>attacks with valid-but-nasty regexps in the past, and I wouldn't want
>>to rely on there not being any now.
>
> The ones I've seen still require some input length (I'm thinking exponential 
> rematch backoff stuff here). I suspect that if your test query matches the RE 
> against a fixed empty string it is hard to be exploited. i.e. I think most of 
> this stuff isn't expensive in terms of compiling the regexp but in
> executing it against text.

Well yes, but presumably if the OP is receiving regexps from users
they will be executed against text sooner or later.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-09 Thread Peter Heitzer
Larry Martell <larry.mart...@gmail.com> wrote:
>On Tue, Aug 8, 2017 at 12:51 PM, Chris Angelico <ros...@gmail.com> wrote:
>> On Wed, Aug 9, 2017 at 2:37 AM, Larry Martell <larry.mart...@gmail.com> 
>> wrote:
>>> Anyone have any code or know of any packages for validating a regexp?
>>>
>>> I have an app that allows users to enter regexps for db searching.
>>> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
>>> causes downstream issues. I'd like to flag it at entry time.
>>
>> re.compile()? Although I'm not sure that 'A|B|' is actually invalid.
>> But re.compile("(") throws.

>Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
>subexpression for regexp' I'd like to flag it before it gets to SQL.

Then you need to do a real sql query with the regex and check if it throws.

-- 
Dipl.-Inform(FH) Peter Heitzer, peter.heit...@rz.uni-regensburg.de
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Cameron Simpson

On 08Aug2017 17:31, Jon Ribbens <jon+use...@unequivocal.eu> wrote:

On 2017-08-08, Chris Angelico <ros...@gmail.com> wrote:

On Wed, Aug 9, 2017 at 2:57 AM, Larry Martell <larry.mart...@gmail.com> wrote:

Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
subexpression for regexp' I'd like to flag it before it gets to SQL.


Okay, so your definition of validity is "what MySQL will accept". In
that case, I'd feed it to MySQL and see if it accepts it. Regexps are
sufficiently varied that you really need to use the same engine for
validation as for execution.


... but bear in mind, there have been ways of doing denial-of-service
attacks with valid-but-nasty regexps in the past, and I wouldn't want
to rely on there not being any now.


The ones I've seen still require some input length (I'm thinking exponential 
rematch backoff stuff here). I suspect that if your test query matches the RE 
against a fixed empty string it is hard to be exploited. i.e. I think most of 
this stuff isn't expensive in terms of compiling the regexp but in executing it 
against text.


Happy to hear to falsifications to my beliefs here.

Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
--
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Jon Ribbens
On 2017-08-08, Chris Angelico <ros...@gmail.com> wrote:
> On Wed, Aug 9, 2017 at 2:57 AM, Larry Martell <larry.mart...@gmail.com> wrote:
>> Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
>> subexpression for regexp' I'd like to flag it before it gets to SQL.
>
> Okay, so your definition of validity is "what MySQL will accept". In
> that case, I'd feed it to MySQL and see if it accepts it. Regexps are
> sufficiently varied that you really need to use the same engine for
> validation as for execution.

... but bear in mind, there have been ways of doing denial-of-service
attacks with valid-but-nasty regexps in the past, and I wouldn't want
to rely on there not being any now.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Chris Angelico
On Wed, Aug 9, 2017 at 2:57 AM, Larry Martell <larry.mart...@gmail.com> wrote:
> On Tue, Aug 8, 2017 at 12:51 PM, Chris Angelico <ros...@gmail.com> wrote:
>> On Wed, Aug 9, 2017 at 2:37 AM, Larry Martell <larry.mart...@gmail.com> 
>> wrote:
>>> Anyone have any code or know of any packages for validating a regexp?
>>>
>>> I have an app that allows users to enter regexps for db searching.
>>> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
>>> causes downstream issues. I'd like to flag it at entry time.
>>
>> re.compile()? Although I'm not sure that 'A|B|' is actually invalid.
>> But re.compile("(") throws.
>
> Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
> subexpression for regexp' I'd like to flag it before it gets to SQL.

Okay, so your definition of validity is "what MySQL will accept". In
that case, I'd feed it to MySQL and see if it accepts it. Regexps are
sufficiently varied that you really need to use the same engine for
validation as for execution.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Larry Martell
On Tue, Aug 8, 2017 at 12:57 PM, Larry Martell <larry.mart...@gmail.com> wrote:
> On Tue, Aug 8, 2017 at 12:51 PM, Chris Angelico <ros...@gmail.com> wrote:
>> On Wed, Aug 9, 2017 at 2:37 AM, Larry Martell <larry.mart...@gmail.com> 
>> wrote:
>>> Anyone have any code or know of any packages for validating a regexp?
>>>
>>> I have an app that allows users to enter regexps for db searching.
>>> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
>>> causes downstream issues. I'd like to flag it at entry time.
>>
>> re.compile()? Although I'm not sure that 'A|B|' is actually invalid.
>> But re.compile("(") throws.
>
> Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
> subexpression for regexp' I'd like to flag it before it gets to SQL.

I guess I will have to do a test query with it and catch the error.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Chris Angelico
On Wed, Aug 9, 2017 at 2:37 AM, Larry Martell <larry.mart...@gmail.com> wrote:
> Anyone have any code or know of any packages for validating a regexp?
>
> I have an app that allows users to enter regexps for db searching.
> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
> causes downstream issues. I'd like to flag it at entry time.

re.compile()? Although I'm not sure that 'A|B|' is actually invalid.
But re.compile("(") throws.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Larry Martell
On Tue, Aug 8, 2017 at 12:51 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Wed, Aug 9, 2017 at 2:37 AM, Larry Martell <larry.mart...@gmail.com> wrote:
>> Anyone have any code or know of any packages for validating a regexp?
>>
>> I have an app that allows users to enter regexps for db searching.
>> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
>> causes downstream issues. I'd like to flag it at entry time.
>
> re.compile()? Although I'm not sure that 'A|B|' is actually invalid.
> But re.compile("(") throws.

Yeah, it does not throw for 'A|B|' - but mysql chokes on it with empty
subexpression for regexp' I'd like to flag it before it gets to SQL.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread Skip Montanaro
> I have an app that allows users to enter regexps for db searching.
> When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
> causes downstream issues. I'd like to flag it at entry time.

Just call re.compile(...) on it and catch any exceptions, modulo
caveats about operating with unvalidated user input.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Validating regexp

2017-08-08 Thread MRAB

On 2017-08-08 17:37, Larry Martell wrote:

Anyone have any code or know of any packages for validating a regexp?

I have an app that allows users to enter regexps for db searching.
When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
causes downstream issues. I'd like to flag it at entry time.


Couldn't you just try compile the regex and catch any exception?

Also, in that way is 'A|B|' invalid?
--
https://mail.python.org/mailman/listinfo/python-list


Validating regexp

2017-08-08 Thread Larry Martell
Anyone have any code or know of any packages for validating a regexp?

I have an app that allows users to enter regexps for db searching.
When a user enters an invalid one (e.g. 'A|B|' is one I just saw) it
causes downstream issues. I'd like to flag it at entry time.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-15 Thread Lele Gaifax
Serhiy Storchaka  writes:

> Seems the documentation is not accurate. Could you file a report on
> https://bugs.python.org/ ?

Thank you everybody answered!

Here it is: http://bugs.python.org/issue28450

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-15 Thread Peter Otten
Serhiy Storchaka wrote:

> On 14.10.16 20:01, Peter Otten wrote:

> def double_bs(s): return "".join(s.split("\\"))
>> ...

> Just use s.replace('\\', r'\\').

D'oh!

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-15 Thread Serhiy Storchaka

On 14.10.16 19:15, Chris Angelico wrote:

I wasn't specifically aware that the re module was doing the same
thing, but it'll be from the same purpose and goal. The idea is that,
for instance, Windows path names in non-raw string literals will no
longer behave differently based on whether the path is "my_user" or
"the_other_user". Definite improvement.


The re module emitted deprecation warnings in 3.5. In 3.6 warnings 
become errors. The idea is that this allows to add new special sequences 
(like \p{...} or \R) in future.



--
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-15 Thread Serhiy Storchaka

On 14.10.16 20:01, Peter Otten wrote:

Lele Gaifax wrote:

So, how am I supposed to achieve the mentioned intent? By doubling the
escape in the replacement?


If there are no escape sequences aimed to be handled by re.sub() you can
escape the replacement wholesale:


re.sub(r'\s+', re.escape(r'\s+'), 'foo bar')

'foo\\s\\+bar'

OK, that probably escaped too much. Second attempt:


re.sub(r'\s+', lambda m: r'\s+', 'foo bar')

'foo\\s+bar'

Better? If that's too much work at runtime:


def double_bs(s): return "".join(s.split("\\"))

...

re.sub(r'\s+', double_bs(r'\s+'), 'foo bar')

'foo\\s+bar'


Just use s.replace('\\', r'\\').


--
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-15 Thread Serhiy Storchaka

On 14.10.16 18:40, Lele Gaifax wrote:

Hi all,

trying out pgcli with Python 3.6.0b2 I got an error related to what seem a
different behaviour, or even a bug, of re.sub().

The original intent is to replace spaces within a string with the regular 
expression
\s+ (see 
https://github.com/dbcli/pgcli/blob/master/pgcli/packages/prioritization.py#L11,
ignore the fact that the re.sub() call seem underoptimal).

With Python 3.5.2 is straightforward:

  $ python3.5
  Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
  [GCC 6.2.0 20160927] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> re.sub(r'\s+', r'\s+', 'foo bar')
  'foo\\s+bar'

While Python 3.6.0b2 gives:

  $ python3.6
  Python 3.6.0b2+ (default, Oct 11 2016, 08:30:05)
  [GCC 6.2.0 20160927] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> re.sub(r'\s+', r'\s+', 'foo bar')
  Traceback (most recent call last):
File "/usr/local/python3.6/lib/python3.6/sre_parse.py", line 945, in 
parse_template
  this = chr(ESCAPES[this][1])
  KeyError: '\\s'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
File "", line 1, in 
File "/usr/local/python3.6/lib/python3.6/re.py", line 191, in sub
  return _compile(pattern, flags).sub(repl, string, count)
File "/usr/local/python3.6/lib/python3.6/re.py", line 326, in _subx
  template = _compile_repl(template, pattern)
File "/usr/local/python3.6/lib/python3.6/re.py", line 317, in _compile_repl
  return sre_parse.parse_template(repl, pattern)
File "/usr/local/python3.6/lib/python3.6/sre_parse.py", line 948, in 
parse_template
  raise s.error('bad escape %s' % this, len(this))
  sre_constants.error: bad escape \s at position 0

Accordingly to the documentation 
(https://docs.python.org/3.6/library/re.html#re.sub)
“unknown escapes [in the repl argument] such as \& are left alone”.

Am I missing something, or is this a regression?


Unknown escapes consisting of "\" following by ASCII letter are errors 
in 3.6 (and warnings in 3.5). Seems the documentation is not accurate. 
Could you file a report on https://bugs.python.org/ ?



--
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Chris Angelico
On Sat, Oct 15, 2016 at 4:48 AM, Ned Batchelder  wrote:
> On Friday, October 14, 2016 at 1:27:09 PM UTC-4, Chris Angelico wrote:
>> On Sat, Oct 15, 2016 at 4:12 AM, Ned Batchelder  
>> wrote:
>> > There doesn't seem to be a change to string literals at all. It's only a
>> > change in the regex engine.
>> >
>> > Python 3.6.0b2 (default, Oct 10 2016, 21:30:05)
>> > [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
>> > Type "help", "copyright", "credits" or "license" for more information.
>> > >>> "\s"
>> > '\\s'
>>
>> Try with -Wall. To avoid breaking every novice Windows program ever
>> written (bar two or three), it's only a warning for now.
>
> I see. I'm not sure how novice users will know to enable warnings that are
> off by default.  Do people regularly run their code with -Wall? I never have,
> and I don't know how I would have seen these warnings.

Tools like ipython can choose to enable warnings by default. I'm not
sure if they do or not, but it'd be a good thing.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Ned Batchelder
On Friday, October 14, 2016 at 1:27:09 PM UTC-4, Chris Angelico wrote:
> On Sat, Oct 15, 2016 at 4:12 AM, Ned Batchelder  
> wrote:
> > There doesn't seem to be a change to string literals at all. It's only a
> > change in the regex engine.
> >
> > Python 3.6.0b2 (default, Oct 10 2016, 21:30:05)
> > [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> "\s"
> > '\\s'
> 
> Try with -Wall. To avoid breaking every novice Windows program ever
> written (bar two or three), it's only a warning for now.

I see. I'm not sure how novice users will know to enable warnings that are
off by default.  Do people regularly run their code with -Wall? I never have,
and I don't know how I would have seen these warnings.

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Lele Gaifax
Ned Batchelder  writes:

> On Friday, October 14, 2016 at 12:50:44 PM UTC-4, Lele Gaifax wrote:
>> Chris Angelico  writes:
>> 
>> > There's a shift as of 3.6 to make unrecognized alphabetic escapes into
>> > errors, or at least warnings.
>> 
>> But we are talking about raw strings here, specifically r'\s+'.
>> 
>> I agree that with plain strings it's a plus.
>
> The raw string means the regex engine gets three characters: backslash,
> s, plus.  It then has to decide what backslash-s means. In 3.6, this is
> an error.  You'll need to escape the backslash for the regex engine:
>
> >>> re.sub(r'\s+', r'\\s+', 'foo bar')
> 'foo\\s+bar'

Thanks for the clarification.

I tested the above syntax and works flawlessly on 2.7, 3.5 and 3.6b2, and I
will therefore suggest it on https://github.com/dbcli/pgcli/issues/595.

bye, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Chris Angelico
On Sat, Oct 15, 2016 at 4:12 AM, Ned Batchelder  wrote:
> There doesn't seem to be a change to string literals at all. It's only a
> change in the regex engine.
>
> Python 3.6.0b2 (default, Oct 10 2016, 21:30:05)
> [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> "\s"
> '\\s'

Try with -Wall. To avoid breaking every novice Windows program ever
written (bar two or three), it's only a warning for now.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Ned Batchelder
On Friday, October 14, 2016 at 1:00:12 PM UTC-4, Chris Angelico wrote:
> On Sat, Oct 15, 2016 at 3:45 AM, Lele Gaifax  wrote:
> > Chris Angelico  writes:
> >
> >> There's a shift as of 3.6 to make unrecognized alphabetic escapes into
> >> errors, or at least warnings.
> >
> > But we are talking about raw strings here, specifically r'\s+'.
> >
> > I agree that with plain strings it's a plus.
> 
> Right; the main change is for non-raw string literals, but it looks
> like the same change was made to regular expressions at the same time.
> IMO that's a good thing - the rule is simply "starting with 3.6, you
> should avoid \Z for any upper- or lower-case Z that doesn't have a
> documented meaning".

There doesn't seem to be a change to string literals at all. It's only a
change in the regex engine.

Python 3.6.0b2 (default, Oct 10 2016, 21:30:05)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "\s"
'\\s'

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Ned Batchelder
On Friday, October 14, 2016 at 12:50:44 PM UTC-4, Lele Gaifax wrote:
> Chris Angelico  writes:
> 
> > There's a shift as of 3.6 to make unrecognized alphabetic escapes into
> > errors, or at least warnings.
> 
> But we are talking about raw strings here, specifically r'\s+'.
> 
> I agree that with plain strings it's a plus.

The raw string means the regex engine gets three characters: backslash,
s, plus.  It then has to decide what backslash-s means. In 3.6, this is
an error.  You'll need to escape the backslash for the regex engine:

>>> re.sub(r'\s+', r'\\s+', 'foo bar')
'foo\\s+bar'

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Peter Otten
Lele Gaifax wrote:

> Peter Otten <__pete...@web.de> writes:
> 
>> Lele Gaifax wrote:
>>
>>> The original intent is to replace spaces within a string with the
>>> regular expression \s+ (see
>>> ...
>>> Accordingly to the documentation
>>> (https://docs.python.org/3.6/library/re.html#re.sub) “unknown escapes
>>> [in the repl argument] such as \& are left alone”.
> 
>> According to
>>
>> https://docs.python.org/dev/library/re.html#re.sub
>>
>> rejection of \s is intentional
>>
>> """
>> Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII
>> letter now are errors.
>> """
> 
> So, how am I supposed to achieve the mentioned intent? By doubling the
> escape in the replacement?

If there are no escape sequences aimed to be handled by re.sub() you can 
escape the replacement wholesale:

>>> re.sub(r'\s+', re.escape(r'\s+'), 'foo bar')
'foo\\s\\+bar'

OK, that probably escaped too much. Second attempt:

>>> re.sub(r'\s+', lambda m: r'\s+', 'foo bar')
'foo\\s+bar'

Better? If that's too much work at runtime:

>>> def double_bs(s): return "".join(s.split("\\"))
... 
>>> re.sub(r'\s+', double_bs(r'\s+'), 'foo bar')
'foo\\s+bar'

>> though IMHO the traceback needs a cleanup.
> 
> And the documentation as well, to clarify the fact immediately, without
> assuming one will scroll down to the "changed in version" part (at least,
> that is what seem the rule in other parts of the manual).
> 
> Thank you,
> ciao, lele.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Chris Angelico
On Sat, Oct 15, 2016 at 3:45 AM, Lele Gaifax  wrote:
> Chris Angelico  writes:
>
>> There's a shift as of 3.6 to make unrecognized alphabetic escapes into
>> errors, or at least warnings.
>
> But we are talking about raw strings here, specifically r'\s+'.
>
> I agree that with plain strings it's a plus.

Right; the main change is for non-raw string literals, but it looks
like the same change was made to regular expressions at the same time.
IMO that's a good thing - the rule is simply "starting with 3.6, you
should avoid \Z for any upper- or lower-case Z that doesn't have a
documented meaning".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Lele Gaifax
Lele Gaifax  writes:

> And the documentation as well, to clarify the fact immediately, without
> assuming one will scroll down to the "changed in version" part (at least, that
> is what seem the rule in other parts of the manual).

Also, I'd prefer the "Changed in 3.6" be less ambiguous whether it refers to
the `pattern` or to the `repl` argument, or to both.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Lele Gaifax
Chris Angelico  writes:

> There's a shift as of 3.6 to make unrecognized alphabetic escapes into
> errors, or at least warnings.

But we are talking about raw strings here, specifically r'\s+'.

I agree that with plain strings it's a plus.

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Lele Gaifax
Peter Otten <__pete...@web.de> writes:

> Lele Gaifax wrote:
>
>> The original intent is to replace spaces within a string with the regular
>> expression \s+ (see
>> ...
>> Accordingly to the documentation
>> (https://docs.python.org/3.6/library/re.html#re.sub) “unknown escapes [in
>> the repl argument] such as \& are left alone”.

> According to
>
> https://docs.python.org/dev/library/re.html#re.sub
>
> rejection of \s is intentional
>
> """
> Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII 
> letter now are errors.
> """

So, how am I supposed to achieve the mentioned intent? By doubling the escape
in the replacement?

> though IMHO the traceback needs a cleanup.

And the documentation as well, to clarify the fact immediately, without
assuming one will scroll down to the "changed in version" part (at least, that
is what seem the rule in other parts of the manual).

Thank you,
ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Peter Otten
Lele Gaifax wrote:

> Hi all,
> 
> trying out pgcli with Python 3.6.0b2 I got an error related to what seem a
> different behaviour, or even a bug, of re.sub().
> 
> The original intent is to replace spaces within a string with the regular
> expression \s+ (see
> 
https://github.com/dbcli/pgcli/blob/master/pgcli/packages/prioritization.py#L11,
> ignore the fact that the re.sub() call seem underoptimal).
> 
> With Python 3.5.2 is straightforward:
> 
>   $ python3.5
>   Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
>   [GCC 6.2.0 20160927] on linux
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> import re
>   >>> re.sub(r'\s+', r'\s+', 'foo bar')
>   'foo\\s+bar'
> 
> While Python 3.6.0b2 gives:
> 
>   $ python3.6
>   Python 3.6.0b2+ (default, Oct 11 2016, 08:30:05)
>   [GCC 6.2.0 20160927] on linux
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> import re
>   >>> re.sub(r'\s+', r'\s+', 'foo bar')
>   Traceback (most recent call last):
> File "/usr/local/python3.6/lib/python3.6/sre_parse.py", line 945, in
> parse_template
>   this = chr(ESCAPES[this][1])
>   KeyError: '\\s'
> 
>   During handling of the above exception, another exception occurred:
> 
>   Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/python3.6/lib/python3.6/re.py", line 191, in sub
>   return _compile(pattern, flags).sub(repl, string, count)
> File "/usr/local/python3.6/lib/python3.6/re.py", line 326, in _subx
>   template = _compile_repl(template, pattern)
> File "/usr/local/python3.6/lib/python3.6/re.py", line 317, in
> _compile_repl
>   return sre_parse.parse_template(repl, pattern)
> File "/usr/local/python3.6/lib/python3.6/sre_parse.py", line 948, in
> parse_template
>   raise s.error('bad escape %s' % this, len(this))
>   sre_constants.error: bad escape \s at position 0
> 
> Accordingly to the documentation
> (https://docs.python.org/3.6/library/re.html#re.sub) “unknown escapes [in
> the repl argument] such as \& are left alone”.
> 
> Am I missing something, or is this a regression?

According to

https://docs.python.org/dev/library/re.html#re.sub

rejection of \s is intentional

"""
Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII 
letter now are errors.
"""

though IMHO the traceback needs a cleanup.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Chris Angelico
On Sat, Oct 15, 2016 at 2:40 AM, Lele Gaifax  wrote:
> Accordingly to the documentation 
> (https://docs.python.org/3.6/library/re.html#re.sub)
> “unknown escapes [in the repl argument] such as \& are left alone”.
>
> Am I missing something, or is this a regression?

Further down, you'll find this note:

Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII
letter now are errors.

There's a shift as of 3.6 to make unrecognized alphabetic escapes into
errors, or at least warnings.

rosuav@sikorsky:~$ python3 -Wall
Python 3.7.0a0 (default:a78446a65b1d+, Sep 29 2016, 02:01:55)
[GCC 6.1.1 20160802] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "C:\Documents\my_user"
sys:1: DeprecationWarning: invalid escape sequence '\D'
sys:1: DeprecationWarning: invalid escape sequence '\m'
'C:\\Documents\\my_user'
>>>
rosuav@sikorsky:~$ python3.5 -Wall
Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
[GCC 6.2.0 20160914] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "C:\Documents\my_user"
'C:\\Documents\\my_user'
>>>

I wasn't specifically aware that the re module was doing the same
thing, but it'll be from the same purpose and goal. The idea is that,
for instance, Windows path names in non-raw string literals will no
longer behave differently based on whether the path is "my_user" or
"the_other_user". Definite improvement.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Different behaviour of regexp in 3.6.0b2

2016-10-14 Thread Lele Gaifax
Hi all,

trying out pgcli with Python 3.6.0b2 I got an error related to what seem a
different behaviour, or even a bug, of re.sub().

The original intent is to replace spaces within a string with the regular 
expression
\s+ (see 
https://github.com/dbcli/pgcli/blob/master/pgcli/packages/prioritization.py#L11,
ignore the fact that the re.sub() call seem underoptimal).

With Python 3.5.2 is straightforward:

  $ python3.5
  Python 3.5.2+ (default, Sep 22 2016, 12:18:14) 
  [GCC 6.2.0 20160927] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> re.sub(r'\s+', r'\s+', 'foo bar')
  'foo\\s+bar'

While Python 3.6.0b2 gives:

  $ python3.6
  Python 3.6.0b2+ (default, Oct 11 2016, 08:30:05) 
  [GCC 6.2.0 20160927] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> re.sub(r'\s+', r'\s+', 'foo bar')
  Traceback (most recent call last):
File "/usr/local/python3.6/lib/python3.6/sre_parse.py", line 945, in 
parse_template
  this = chr(ESCAPES[this][1])
  KeyError: '\\s'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
File "", line 1, in 
File "/usr/local/python3.6/lib/python3.6/re.py", line 191, in sub
  return _compile(pattern, flags).sub(repl, string, count)
File "/usr/local/python3.6/lib/python3.6/re.py", line 326, in _subx
  template = _compile_repl(template, pattern)
File "/usr/local/python3.6/lib/python3.6/re.py", line 317, in _compile_repl
  return sre_parse.parse_template(repl, pattern)
File "/usr/local/python3.6/lib/python3.6/sre_parse.py", line 948, in 
parse_template
  raise s.error('bad escape %s' % this, len(this))
  sre_constants.error: bad escape \s at position 0

Accordingly to the documentation 
(https://docs.python.org/3.6/library/re.html#re.sub) 
“unknown escapes [in the repl argument] such as \& are left alone”.

Am I missing something, or is this a regression?

In the meantime, I will alert the pgcli people.

Thanks in advance,
bye, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list


[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-02 Thread Steve Newcomb

Steve Newcomb added the comment:

On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you ... make a small set of regular expressions that 
> demonstrate the performance regression.
>
Done.  Attachments:

test.py : Code that exercises re.sub() and outputs a profile report.

test_output_2.7.6.txt : Output of test.py under Python 2.7.6.

test_output_2.7.12.txt : Output of test.py under Python 2.7.12.

p17.188.htm -- test data: public information from the U.S. Internal 
Revenue Service.

Equivalent hardware was used in both cases.

The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
2.7.6's.  It's a significant difference, but...

...it was not the dramatic degradation I expected to find in this 
exercise.  Therefore I attempted to tease what I was looking for out of 
the profile stats I already uploaded to this site, made from actual 
production runs.  My attempts are all found in an hg repository that can 
be downloaded from 
sftp://s...@coolheads.com//files/py-re-perform-276-2712 using password 
bysIe20H .

I do not feel the latter work took me where I wanted to go, and I think 
the reason is that, at least for purposes of our application, Python 
2.7.12 has been so extensively refactored since Python 2.7.6.  So it's 
an apples-to-oranges comparison, apparently.  Still, the performance 
difference for re.sub() is quite dramatic , and re.sub() is the only 
comparable function whose performance dramatically worsened: in our 
application, 2.7.12's re.sub() takes 3.04 times as long as 2.7.6's.

The good news, of course, is that by and large the performance of the 
other *comparable* functions largely improved, often dramatically.  But 
at least in our application, it doesn't come close to making up for the 
degradation in re.sub().

My by-the-gut bottom line: somebody who really knows the re module 
should take a deep look at re.sub().  Why would re.sub(), unlike all 
others, take so much longer to run, while *every* other function in the 
re module get (often much) faster?  It feels like there's a bug 
somewhere in re.sub().

Steve Newcomb

--
Added file: http://bugs.python.org/file44335/test.py
Added file: http://bugs.python.org/file44336/test_output_2.7.6.txt
Added file: http://bugs.python.org/file44337/p17-188.htm
Added file: http://bugs.python.org/file44338/test_output_2.7.12.txt

___
Python tracker 

___#!/usr/bin/env python2

import codecs, profile, os, re, sys

hrefRE = re.compile(
''.join(
[
r'href=',
r'(?P["\'])',
r'(?P',
r'.*?',
r')',
r'(?=quote)',
],
),
)
###
onePathSegmentMS = ''.join(
[
r'(?P<_pathSeg>',
r'(',
r'/?',
r'(',
r'(?!',
r'[ \t\r\n]+',
r'$',
r')',
u'[^%s]' % ( re.escape( r'/?#')),
r')+',
r'|',
r'/',
r')',
r')',
],
)
onePathSegmentRE = re.compile( onePathSegmentMS)

###
uriMS = r''.join(
(
r'(?P',  ## leading whitespace is OK and ignorable; 
see http://dev.w3.org/html5/spec-LC/urls.html
r'[ \t\r\n]+',
r')?',
r'(',
r'(?P',
r'https?',  
r')',
r':\/{0,2}',   ## accounts for encountered error: only 0 or 1 slash 
instead of 2
r')?',
r'(?P',
r'(?P',
r'(',
r'(?P<_userinfo>',
r'[^%s]+' % ( re.escape( r'@/[:?#')),
r')',
re.escape( '@'),
r')?',
r')',
r'(?P',
r'(?P',
re.escape( r'['),
r')?',
r'(',
r'(?P',
r'(',
r'[0-9]{1,3}%s' % ( re.escape( r'.')),
r'){3}',
r'[0-9]{1,3}',
r')',
r'|',
r'(?P',
r'(',
r'[0-9A-Fa-f]{0,4}%s' % ( re.escape( ':')),
r'){1,7}',
r'[0-9A-Fa-f]{0,4}',
r')',
r'|',
r'(?P',
r'(',
r'[^%s]+?' % ( re.escape( r']:/?#')),  ## this may 
have dots
r'\.',
r')+',
r'(?P',  ## top-level domain, e.g. "com", "gov" 
etc.
r'(',

[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-01 Thread Steve Newcomb

Steve Newcomb added the comment:

On 09/01/2016 05:01 PM, Steve Newcomb wrote:
>
>> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
>> 2.7.6's.  It's a significant difference, but...
>>
>> ...it was not the dramatic degradation I expected to find in this 
>> exercise.
On second (third?) thought, the degree of degradation could easily 
depend on the source data being processed.  Maybe test.py does, in fact, 
demonstrate the problem, but the test data I used (p17-118.htm) do not 
demonstrate a terribly severe case of the problem.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-01 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +haypo
type:  -> performance

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-09-01 Thread Steve Newcomb

Steve Newcomb added the comment:

Oops.  The correct url is sftp://coolheads.com/files/py-re-perform-276v2712/

On 09/01/2016 04:52 PM, Steve Newcomb wrote:
> On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
>> Raymond Hettinger added the comment:
>>
>> It would be helpful if you ... make a small set of regular 
>> expressions that demonstrate the performance regression.
>>
> Done.  Attachments:
>
> test.py : Code that exercises re.sub() and outputs a profile report.
>
> test_output_2.7.6.txt : Output of test.py under Python 2.7.6.
>
> test_output_2.7.12.txt : Output of test.py under Python 2.7.12.
>
> p17.188.htm -- test data: public information from the U.S. Internal 
> Revenue Service.
>
> Equivalent hardware was used in both cases.
>
> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
> 2.7.6's.  It's a significant difference, but...
>
> ...it was not the dramatic degradation I expected to find in this 
> exercise.  Therefore I attempted to tease what I was looking for out 
> of the profile stats I already uploaded to this site, made from actual 
> production runs.  My attempts are all found in an hg repository that 
> can be downloaded from 
> sftp://s...@coolheads.com//files/py-re-perform-276-2712 using password 
> bysIe20H .
>
> I do not feel the latter work took me where I wanted to go, and I 
> think the reason is that, at least for purposes of our application, 
> Python 2.7.12 has been so extensively refactored since Python 2.7.6.  
> So it's an apples-to-oranges comparison, apparently.  Still, the 
> performance difference for re.sub() is quite dramatic , and re.sub() 
> is the only comparable function whose performance dramatically 
> worsened: in our application, 2.7.12's re.sub() takes 3.04 times as 
> long as 2.7.6's.
>
> The good news, of course, is that by and large the performance of the 
> other *comparable* functions largely improved, often dramatically.  
> But at least in our application, it doesn't come close to making up 
> for the degradation in re.sub().
>
> My by-the-gut bottom line: somebody who really knows the re module 
> should take a deep look at re.sub().  Why would re.sub(), unlike all 
> others, take so much longer to run, while *every* other function in 
> the re module get (often much) faster?  It feels like there's a bug 
> somewhere in re.sub().
>
> Steve Newcomb
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Steve Newcomb

Steve Newcomb added the comment:

On 08/30/2016 01:24 PM, Serhiy Storchaka wrote:
> Serhiy Storchaka added the comment:
>
> According to your profile results all re functions are 2.5-4 times faster 
> under 2.7.12 than under 2.7.6. May be I misinterpret it?
I can't explain the profiler's report.  I'm kind of glad that you, too, 
find it baffling.  Is it possible that the profiler doesn't actually 
work predictably in the multiprocessing context?  If so, one thing I can 
*easily* do is to disable multiprocessing in that code and see what the 
profiler reports are then.  It will take all night, but I'm beginning to 
think it would be worthwhile, because it might point the finger of blame 
at either the multiprocessing module or the re module, but not both at once.

(I originally provided a "disable multiprocessing" capability in that 
code in order to use the Python debugger with it.  It would kind of make 
sense if the profiler had limitations similar to those of the debugger.)
>
> Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 
> 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The 
> rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 
> 2.7.12 (4519.872 - 4474.890 = 44.982).
Yeah, I'm beginning to wonder if those strange statistics, too, are 
artifacts of using a single-process profiler in a multiprocessing context.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Steve Newcomb

Steve Newcomb added the comment:

On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you could run "hg bisect" with your set-up to isolate 
> the change that causes the problem.
I don't think I understand you.  There's no difference in the Python 
code we're using in both cases.  The only differences, AFAIK, are in the 
Python interpreter and in the Linux distribution.  I'm not qualified to 
analyze the differences in the latter items.
>Alternatively, make a small set of regular expressions that demonstrate 
> the performance regression.
It will be hard to do that, because the code is so complex, and because 
debugging in the multiprocessing context is so hairy. Still, it's the 
only approach I can think of, too.  Sigh.  I'm thinking about it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

According to your profile results all re functions are 2.5-4 times faster under 
2.7.12 than under 2.7.6. May be I misinterpret it?

Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 
of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time 
is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 
(4519.872 - 4474.890 = 44.982).

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Raymond Hettinger

Raymond Hettinger added the comment:

It would be helpful if you could run "hg bisect" with your set-up to isolate 
the change that causes the problem.  Alternatively, make a small set of regular 
expressions that demonstrate the performance regression.

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27898] regexp performance degradation between 2.7.6 and 2.7.12

2016-08-30 Thread Steve Newcomb

New submission from Steve Newcomb:

Our most regular-expression-processing-intensive Python 2.7 code takes 2.5x 
more execution time in 2.7.12 than it did in 2.7.6.  I discovered this after 
upgrading from Ubuntu 14.04 to Ubuntu 16.04.  Basically this code runs 
thousands of compiled regular expressions on thousands of texts.  Both the 
multiprocessing module and the re module are heavily used.

See attached profiler outputs, which look quite different in several respects.  
I used the profiling module to profile the same Python code, processing the 
same data, using the same hardware, under both Ubuntu 14.04 (Python 2.7.6) and 
Ubuntu 16.04 (Python 2.7.12).  

It is striking, for example, that cPickle.load appears so prominently in the 
2.7.12 profile -- a fact which appears to implicate the multiprocessing module 
somehow.  But I suspect that the re module is more likely the main source of 
the problem, because the execution times of other production steps -- steps 
that do not call the multiprocessing module -- also appear to be extended to a 
degree that is roughly proportional to the amount of regular expression 
processing done in those other steps.

I will happily provide any further information I can.  Any insights about this 
surprisingly severe performance degradation would be welcome.

--
files: profiles_2.7.6_vs_2.7.12
messages: 273932
nosy: steve.newcomb
priority: normal
severity: normal
status: open
title: regexp performance degradation between 2.7.6 and 2.7.12
Added file: http://bugs.python.org/file44277/profiles_2.7.6_vs_2.7.12

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27898>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



RegExp help

2016-02-10 Thread Larry Martell
Given this string:

>>> s = """|Type=Foo
... |Side=Left"""
>>> print s
|Type=Foo
|Side=Left

I can match with this:

>>> m = re.search(r'^\|Type=(.*)$\n^\|Side=(.*)$',s,re.MULTILINE)
>>> print m.group(0)
|Type=Foo
|Side=Left
>>> print m.group(1)
Foo
>>> print m.group(2)
Left

But when I try and sub it doesn't work:

>>> rn = re.sub(r'^\|Type=(.*)$^\|Side=(.*)$', r'|Side Type=\2 
>>> \1',s,re.MULTILINE)
>>> print rn
|Type=Foo
|Side=Left

What very stupid thing am I doing wrong?
-- 
https://mail.python.org/mailman/listinfo/python-list


  1   2   3   4   5   6   7   8   9   10   >