subject:"RegEx question"

Re: a regex question

2019-10-25 Thread dieter

Maggie Q Roth  writes:
> There are two primary types of lines in the log:
>
> 60.191.38.xx/
> 42.120.161.xx   /archives/1005
>
> I know how to write regex to match each line, but don't get the good result
> with one regex to match both lines.
>
> Can you help?

When I look at these lines, I see 2 fields separated by whitespace
(note that two example lines are very very few to guess the
proper pattern). I would not use a regular expression
in this case, but the `split` string method.

A regular expression for this pattern could be `(\S+)\s+(.*)` which reads
a non-empty sequences of none whitespace (assigned to group 1),
whitespace, any sequence (assigned to group 2)
(note that the regular expression above is given on the
regex level. The string in your Python code may look slightly different).

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: a regex question

2019-10-25 Thread Antoon Pardon

On 25/10/19 12:22, Maggie Q Roth wrote:
> Hello
>
> There are two primary types of lines in the log:
>
> 60.191.38.xx/
> 42.120.161.xx   /archives/1005
>
> I know how to write regex to match each line, but don't get the good result
> with one regex to match both lines.

Could you provide the regexes that you have for each line?

-- 
Antoon.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: a regex question

2019-10-25 Thread Brian Oney via Python-list




On October 25, 2019 12:22:44 PM GMT+02:00, Maggie Q Roth  
wrote:
>Hello
>
>There are two primary types of lines in the log:
>
>60.191.38.xx/
>42.120.161.xx   /archives/1005
>
>I know how to write regex to match each line, but don't get the good
>result
>with one regex to match both lines.

What is a good result?

The is an re.MULTILINE flag. Did you try that? What does that do?

-- 
https://mail.python.org/mailman/listinfo/python-list

a regex question

2019-10-25 Thread Maggie Q Roth

Hello

There are two primary types of lines in the log:

60.191.38.xx/
42.120.161.xx   /archives/1005

I know how to write regex to match each line, but don't get the good result
with one regex to match both lines.

Can you help?

Thanks,
Maggie
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

On Aug 18, 12:22 pm, Jussi Piitulainen 
wrote:
> Frank Koshti writes:
> > not always placed in HTML, and even in HTML, they may appear in
> > strange places, such as Hello. My specific issue
> > is I need to match, process and replace $foo(x=3), knowing that
> > (x=3) is optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
>
> Adding a ? after the meant-to-be-optional expression would let the
> regex engine know what you want. You can also separate the mandatory
> and the optional part in the regex to receive pairs as matches. The
> test program below prints this:
>
> >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc
> ('$foo', '')
> ('$foo', '(bar=3)')
> ('$foo', '($)')
> ('$foo', '')
> ('$bar', '(v=0)')
>
> Here is the program:
>
> import re
>
> def grab(text):
>     p = re.compile(r'([$]\w+)([(][^()]+[)])?')
>     return re.findall(p, text)
>
> def test(html):
>     print(html)
>     for hit in grab(html):
>         print(hit)
>
> if __name__ == '__main__':
>     test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread python

Steven,

Well done!!!

Regards,
Malcolm
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Jussi Piitulainen

Frank Koshti writes:

> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as Hello. My specific issue
> is I need to match, process and replace $foo(x=3), knowing that
> (x=3) is optional, and the token might appear simply as $foo.
> 
> To do this, I decided to use:
> 
> re.compile('\$\w*\(?.*?\)').findall(mystring)
> 
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

Adding a ? after the meant-to-be-optional expression would let the
regex engine know what you want. You can also separate the mandatory
and the optional part in the regex to receive pairs as matches. The
test program below prints this:

>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etchttp://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

On Aug 18, 11:48 am, Peter Otten <__pete...@web.de> wrote:
> Frank Koshti wrote:
> > I need to match, process and replace $foo(x=3), knowing that (x=3) is
> > optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
> >>> s = """
>
> ... $foo1
> ... $foo2()
> ... $foo3(anything could go here)
> ... """>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
>
> ['$foo1', '$foo2()', '$foo3(anything could go here)']

PERFECT-
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Vlastimil Brom

2012/8/18 Frank Koshti :
> Hey Steven,
>
> Thank you for the detailed (and well-written) tutorial on this very
> issue. I actually learned a few things! Though, I still have
> unresolved questions.
>
> The reason I don't want to use an XML parser is because the tokens are
> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as Hello. My specific issue is
> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
>
> Thanks,
> Frank
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
Although I don't quite get the pattern you are using (with respect to
the specified task), you most likely need raw string syntax for the
pattern, e.g.: r"...", instead of "...", or you have to double all
backslashes (which should be escaped), i.e. \\w etc.

I am likely misunderstanding the specification, as the following:
>>> re.sub(r"\$foo\(x=3\)", "bar", "Hello")
'Hello'
>>>
is probably not the desired output.

For some kind of "processing" the matched text, you can use the
replace function instead of the replace pattern in re.sub too.
see
http://docs.python.org/library/re.html#re.sub

hth,
  vbr
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Peter Otten

Frank Koshti wrote:

> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
> 
> To do this, I decided to use:
> 
> re.compile('\$\w*\(?.*?\)').findall(mystring)
> 
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

>>> s = """
... $foo1
... $foo2()
... $foo3(anything could go here)
... """
>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
['$foo1', '$foo2()', '$foo3(anything could go here)']


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

Hey Steven,

Thank you for the detailed (and well-written) tutorial on this very
issue. I actually learned a few things! Though, I still have
unresolved questions.

The reason I don't want to use an XML parser is because the tokens are
not always placed in HTML, and even in HTML, they may appear in
strange places, such as Hello. My specific issue is
I need to match, process and replace $foo(x=3), knowing that (x=3) is
optional, and the token might appear simply as $foo.

To do this, I decided to use:

re.compile('\$\w*\(?.*?\)').findall(mystring)

the issue with this is it doesn't match $foo by itself, and requires
there to be () at the end.

Thanks,
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Steven D'Aprano

On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote:

> Hi,
> 
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would appreciate
> some direction on how to proceed.

Others have already given you excellent advice to NOT use regular 
expressions to parse HTML files, but to use a proper HTML parser instead.

However, since I remember how hard it was to get started with regexes, 
I'm going to ignore that advice and show you how to abuse regexes to 
search for text, and pretend that they aren't HTML tags.

Here's your string you want to search for:

> @foo1

You want to find a piece of text that starts with "@", followed by 
any alphanumeric characters, followed by "".

We start by compiling a regex:

import re
pattern = r"@\w+"
regex = re.compile(pattern, re.I)

First we import the re module. Then we define a pattern string. Note that 
I use a "raw string" instead of a regular string -- this is not 
compulsory, but it is very common.

The difference between a raw string and a regular string is how they 
handle backslashes. In Python, some (but not all!) backslashes are 
special. For example, the regular string "\n" is not two characters, 
backslash-n, but a single character, Newline. The Python string parser 
converts backslash combinations as special characters, e.g.:

\n => newline
\t => tab
\0 => ASCII Null character
\\ => a single backslash
etc.

We often call these "backslash escapes".

Regular expressions use a lot of backslashes, and so it is useful to 
disable the interpretation of backlash escapes when writing regex 
patterns. We do that with a "raw string" -- if you prefix the string with 
the letter r, the string is raw and backslash-escapes are ignored:

# ordinary "cooked" string:
"abc\n" => a b c newline

# raw string
r"abc\n" => a b c backslash n

Here is our pattern again:

pattern = r"@\w+"

which is thirteen characters:

less-than h 1 greater-than at-sign backslash w plus-sign less-than slash 
h 1 greater-than

Most of the characters shown just match themselves. For example, the @ 
sign will only match another @ sign. But some have special meaning to the 
regex:

\w doesn't match "backslash w", but any alphanumeric character;

+ doesn't match a plus sign, but tells the regex to match the previous 
symbol one or more times. Since it immediately follows \w, this means 
"match at least one alphanumeric character".

Now we feed that string into the re.compile, to create a pre-compiled 
regex. (This step is optional: any function which takes a compiled regex 
will also accept a string pattern. But pre-compiling regexes which you 
are going to use repeatedly is a good idea.)

regex = re.compile(pattern, re.I)

The second argument to re.compile is a flag, re.I which is a special 
value that tells the regular expression to ignore case, so "h" will match 
both "h" and "H".

Now on to use the regex. Here's a bunch of text to search:

text = """Now is the time for all good men blah blah blah spam
and more text here blah blah blah
and some more @victory blah blah blah"""

And we search it this way:

mo = re.search(regex, text)

"mo" stands for "Match Object", which is returned if the regular 
expression finds something that matches your pattern. If nothing matches, 
then None is returned instead.

if mo is not None:
print(mo.group(0))

=> prints @victory

So far so good. But we can do better. In this case, we don't really care 
about the tags , we only care about the "victory" part. Here's how to 
use grouping to extract substrings from the regex:

pattern = r"@(\w+)"  # notice the round brackets ()
regex = re.compile(pattern, re.I)
mo = re.search(regex, text)
if mo is not None:
print(mo.group(0))
print(mo.group(1))

This prints:

@victory
victory

Hope this helps.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Frank Koshti

I think the point was missed. I don't want to use an XML parser. The
point is to pick up those tokens, and yes I've done my share of RTFM.
This is what I've come up with:

'\$\w*\(?.*?\)'

Which doesn't work well on the above example, which is partly why I
reached out to the group. Can anyone help me with the regex?

Thanks,
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Roy Smith

In article 
<385e732e-1c02-4dd0-ab12-b92890bbe...@o3g2000yqp.googlegroups.com>,
 Frank Koshti  wrote:

> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
> 
> 
> @foo1
> @foo2()
> @foo3(anything could go here)

Don't try to parse HTML with regexes.  Use a real HTML parser, such as 
lxml (http://lxml.de/).
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-18 Thread Mark Lawrence


On 18/08/2012 06:42, Chris Angelico wrote:

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti  wrote:

Hi,

I'm new to regular expressions. I want to be able to match for tokens
with all their properties in the following examples. I would
appreciate some direction on how to proceed.


@foo1
@foo2()
@foo3(anything could go here)


You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA



Totally agree with the sentiment.  There's a comparison of python 
parsers here http://nedbatchelder.com/text/python-parsers.html


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2012-08-17 Thread Chris Angelico

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti  wrote:
> Hi,
>
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
>
>
> @foo1
> @foo2()
> @foo3(anything could go here)

You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Regex Question

2012-08-17 Thread Frank Koshti

Hi,

I'm new to regular expressions. I want to be able to match for tokens
with all their properties in the following examples. I would
appreciate some direction on how to proceed.


@foo1
@foo2()
@foo3(anything could go here)


Thanks-
Frank
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread Thomas Jollans

On 29/07/11 19:52, Rustom Mody wrote:
> MRAB wrote:
> > findall returns a list of tuples (what the groups captured) if there
> is more than 1 group,
> > or a list of strings (what the group captured) if there is 1 group,
> or a list of
> > strings (what the regex matched) if there are no groups.
>
> Thanks.
> It would be good to put this in the manual dont you think?
It is in the manual.
>
> Also, the manual says in the 'match' section
>
> "Note If you want to locate a match anywhere in /string/, use search()
> instead."
>
> to guard against users using match when they should be using search.
>
> Likewise it would be helpful if the manual also said (in the
> match,search sections)
> "If more than one match/search is required use findall"
>
>

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread Rustom Mody

MRAB wrote:
> findall returns a list of tuples (what the groups captured) if there is
more than 1 group,
> or a list of strings (what the group captured) if there is 1 group, or a
list of
> strings (what the regex matched) if there are no groups.

Thanks.
It would be good to put this in the manual dont you think?

Also, the manual says in the 'match' section

"Note If you want to locate a match anywhere in *string*, use search()instead."

to guard against users using match when they should be using search.

Likewise it would be helpful if the manual also said (in the match,search
sections)
"If more than one match/search is required use findall"
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread MRAB


On 29/07/2011 16:45, Thomas Jollans wrote:

On 29/07/11 16:53, rusi wrote:

Can someone throw some light on this anomalous behavior?


import re
r = re.search('a(b+)', 'ababbaaab')
r.group(1)

'b'

r.group(0)

'ab'

r.group(2)

Traceback (most recent call last):
   File "", line 1, in
IndexError: no such group


re.findall('a(b+)', 'ababbaaab')

['b', 'bb', 'b']

So evidently group counts by number of '()'s and not by number of
matches (and this is the case whether one uses match or search). So
then whats the point of search-ing vs match-ing?

Or equivalently how to move to the groups of the next match in?

[Side note: The docstrings for this really suck:


help(r.group)

Help on built-in function group:

group(...)



Pretty standard regex behaviour: Group 1 is the first pair of brackets.
Group 2 is the second, etc. pp. Group 0 is the whole match.
The difference between matching and searching is that match assumes that
the start of the regex coincides with the start of the string (and this
is documented in the library docs IIRC). re.match(exp, s) is equivalent
to re.search('^'+exp, s). (if not exp.startswith('^'))

Apparently, findall() returns the content of the first group if there is
one. I didn't check this, but I assume it is documented.


findall returns a list of tuples (what the groups captured) if there is
more than 1 group, or a list of strings (what the group captured) if
there is 1 group, or a list of strings (what the regex matched) if
there are no groups.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2011-07-29 Thread Thomas Jollans

On 29/07/11 16:53, rusi wrote:
> Can someone throw some light on this anomalous behavior?
>
 import re
 r = re.search('a(b+)', 'ababbaaab')
 r.group(1)
> 'b'
 r.group(0)
> 'ab'
 r.group(2)
> Traceback (most recent call last):
>   File "", line 1, in 
> IndexError: no such group
>
 re.findall('a(b+)', 'ababbaaab')
> ['b', 'bb', 'b']
>
> So evidently group counts by number of '()'s and not by number of
> matches (and this is the case whether one uses match or search). So
> then whats the point of search-ing vs match-ing?
>
> Or equivalently how to move to the groups of the next match in?
>
> [Side note: The docstrings for this really suck:
>
 help(r.group)
> Help on built-in function group:
>
> group(...)
>

Pretty standard regex behaviour: Group 1 is the first pair of brackets.
Group 2 is the second, etc. pp. Group 0 is the whole match.
The difference between matching and searching is that match assumes that
the start of the regex coincides with the start of the string (and this
is documented in the library docs IIRC). re.match(exp, s) is equivalent
to re.search('^'+exp, s). (if not exp.startswith('^'))

Apparently, findall() returns the content of the first group if there is
one. I didn't check this, but I assume it is documented.

 - Thomas
-- 
http://mail.python.org/mailman/listinfo/python-list

regex question

2011-07-29 Thread rusi

Can someone throw some light on this anomalous behavior?

>>> import re
>>> r = re.search('a(b+)', 'ababbaaab')

>>> r.group(1)
'b'
>>> r.group(0)
'ab'
>>> r.group(2)
Traceback (most recent call last):
  File "", line 1, in 
IndexError: no such group

>>> re.findall('a(b+)', 'ababbaaab')
['b', 'bb', 'b']

So evidently group counts by number of '()'s and not by number of
matches (and this is the case whether one uses match or search). So
then whats the point of search-ing vs match-ing?

Or equivalently how to move to the groups of the next match in?

[Side note: The docstrings for this really suck:

>>> help(r.group)
Help on built-in function group:

group(...)

>>>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-06 Thread Ethan Furman

Many thanks to all who replied!  And, yes, I will *definitely* use raw 
strings from now on.  :)


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Ethan Furman


Ethan Furman wrote:

Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit 
(Intel)]

Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

  In [161]: re.findall('\d+','this is test a3 attempt 79')
  Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but 
when I add the \b word boundaries I get:


  In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
  Out[162]: []

What am I missing?

~Ethan~



ARGH!!

Okay, I need two \\ so I'm not trying to match a backspace.  I knew 
(okay, hoped ;) I would figure it out once I posted the question and 
moved on.


*sheepish grin*

--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Nobody

On Thu, 02 Jul 2009 09:38:56 -0700, Ethan Furman wrote:

> Greetings!
> 
> My closest to successfull attempt:
> 
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
> Type "copyright", "credits" or "license" for more information.
> 
> IPython 0.9.1 -- An enhanced Interactive Python.
> 
>In [161]: re.findall('\d+','this is test a3 attempt 79')
>Out[161]: ['3', '79']
> 
> What I really want in just the 79, as a3 is not a decimal number, but 
> when I add the \b word boundaries I get:
> 
>In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
>Out[162]: []
> 
> What am I missing?

You need to use a raw string (r'...') to prevent \b from being interpreted
as a backspace:

re.findall(r'\b\d+\b','this is test a3 attempt 79')

\d isn't a recognised escape sequence, so it doesn't get interpreted:

> print '\b'
   ^H
> print '\d'
\d
> print r'\b'
\b

Try to get into the habit of using raw strings for regexps.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Sjoerd Mullender


On 2009-07-02 18:38, Ethan Furman wrote:

Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

In [161]: re.findall('\d+','this is test a3 attempt 79')
Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but
when I add the \b word boundaries I get:

In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
Out[162]: []

What am I missing?

~Ethan~


Try this:
>>> re.findall(r'\b\d+\b','this is test a3 attempt 79')
['79']

The \b is a backspace, by using raw strings you get an actual backslash 
and b.


--
Sjoerd Mullender
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question on .findall and \b

2009-07-02 Thread Tim Chase

Ethan Furman wrote:

Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

   In [161]: re.findall('\d+','this is test a3 attempt 79')
   Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but 
when I add the \b word boundaries I get:

   In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
   Out[162]: []

What am I missing?

The sneaky detail that the regexp should be in a raw string 
(always a good practice), not a cooked string:

  r'\b\d+\b'

The "\d" isn't a valid character-expansion, so python leaves it 
alone.  However, I believe the "\b" is a control character, so 
your actual string ends up something like:

  >>> print repr('\b\d+\b')
  '\x08\\d+\x08'
  >>> print repr(r'\b\d+\b')
  '\\b\\d+\\b'

the first of which doesn't match your target string, as you might 
imagine.

-tkc

--
http://mail.python.org/mailman/listinfo/python-list

regex question on .findall and \b

2009-07-02 Thread Ethan Furman


Greetings!

My closest to successfull attempt:

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.

  In [161]: re.findall('\d+','this is test a3 attempt 79')
  Out[161]: ['3', '79']

What I really want in just the 79, as a3 is not a decimal number, but 
when I add the \b word boundaries I get:


  In [162]: re.findall('\b\d+\b','this is test a3 attempt 79')
  Out[162]: []

What am I missing?

~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2008-10-29 Thread Terry Reedy


MalteseUnderdog wrote:

Hi there I just started python (but this question isn't that trivial
since I couldn't find it in google :) )

I have the following text file entries (simplified)

start  #frag 1 start
x=Dog # frag 1 end
stop
start# frag 2 start
x=Cat # frag 2 end
stop
start #frag 3 start
x=Dog #frag 3 end
stop


I need a regex expression which returns the start to the x=ANIMAL for
only the x=Dog fragments so all my entries should be start ...
(something here) ... x=Dog .  So I am really interested in fragments 1
and 3 only.


As I understand the above
I would first write a generator that separates the file into fragments 
and yields them one at a time.  Perhaps something like


def fragments(ifile):
  frag = []
  for line in ifile:
frag += line
if :
  yield frag
  frag = []

Then I would iterate through fragments, testing for the ones I want:

for frag in fragments(somefile):
  if 'x=Dog' in frag:


Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2008-10-29 Thread Arnaud Delobelle

On Oct 29, 7:01 pm, Tim Chase <[EMAIL PROTECTED]> wrote:
> > I need a regex expression which returns the start to the x=ANIMAL for
> > only the x=Dog fragments so all my entries should be start ...
> > (something here) ... x=Dog .  So I am really interested in fragments 1
> > and 3 only.
>
> > My idea (primitive) ^start.*?x=Dog doesn't work because clearly it
> > would return results
>
> > start
> > x=Dog  # (good)
>
> > and
>
> > start
> > x=Cat
> > stop
> > start
> > x=Dog # bad since I only want start ... x=Dog portion
>
> Looks like the following does the trick:
>
>  >>> s = """start      #frag 1 start
> ... x=Dog # frag 1 end
> ... stop
> ... start    # frag 2 start
> ... x=Cat # frag 2 end
> ... stop
> ... start     #frag 3 start
> ... x=Dog #frag 3 end
> ... stop"""
>  >>> import re
>  >>> r = re.compile(r'^start.*\nx=Dog.*\nstop.*', re.MULTILINE)
>  >>> for i, result in enumerate(r.findall(s)):
> ...     print i, repr(result)
> ...
> 0 'start      #frag 1 start\nx=Dog # frag 1 end\nstop'
> 1 'start     #frag 3 start\nx=Dog #frag 3 end\nstop'
>
> -tkc

This will only work if 'x=Dog' directly follows 'start' (which happens
in the given example).  If that's not necessarily the case, I would do
it in two steps (in fact I wouldn't use regexps probably but...):

>>> for chunk in re.split(r'\nstop', data):
... m = re.search('^start.*^x=Dog', chunk, re.DOTALL |
re.MULTILINE)
... if m: print repr(m.group())
...
'start  #frag 1 start \nx=Dog'
'start #frag 3 start \nx=Dog'

--
Arnaud

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2008-10-29 Thread Tim Chase


I need a regex expression which returns the start to the x=ANIMAL for
only the x=Dog fragments so all my entries should be start ...
(something here) ... x=Dog .  So I am really interested in fragments 1
and 3 only.

My idea (primitive) ^start.*?x=Dog doesn't work because clearly it
would return results

start
x=Dog  # (good)

and

start
x=Cat
stop
start
x=Dog # bad since I only want start ... x=Dog portion


Looks like the following does the trick:

>>> s = """start  #frag 1 start
... x=Dog # frag 1 end
... stop
... start# frag 2 start
... x=Cat # frag 2 end
... stop
... start #frag 3 start
... x=Dog #frag 3 end
... stop"""
>>> import re
>>> r = re.compile(r'^start.*\nx=Dog.*\nstop.*', re.MULTILINE)
>>> for i, result in enumerate(r.findall(s)):
... print i, repr(result)
...
0 'start  #frag 1 start\nx=Dog # frag 1 end\nstop'
1 'start #frag 3 start\nx=Dog #frag 3 end\nstop'

-tkc







--
http://mail.python.org/mailman/listinfo/python-list

Python Regex Question

2008-10-29 Thread MalteseUnderdog


Hi there I just started python (but this question isn't that trivial
since I couldn't find it in google :) )

I have the following text file entries (simplified)

start  #frag 1 start
x=Dog # frag 1 end
stop
start# frag 2 start
x=Cat # frag 2 end
stop
start #frag 3 start
x=Dog #frag 3 end
stop


I need a regex expression which returns the start to the x=ANIMAL for
only the x=Dog fragments so all my entries should be start ...
(something here) ... x=Dog .  So I am really interested in fragments 1
and 3 only.

My idea (primitive) ^start.*?x=Dog doesn't work because clearly it
would return results

start
x=Dog  # (good)

and

start
x=Cat
stop
start
x=Dog # bad since I only want start ... x=Dog portion

Can you help me ?

Thanks
JP, Malta.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python regex question

2008-08-15 Thread Tim N. van der Leeuw


Hey Gerhard,


Gerhard Häring wrote:
> 
> Tim van der Leeuw wrote:
>> Hi,
>> 
>> I'm trying to create a regular expression for matching some particular 
>> XML strings. I want to extract the contents of a particular XML tag, 
>> only if it follows one tag, but not follows another tag. Complicating 
>> this, is that there can be any number of other tags in between. [...]
> 
> Sounds like this would be easier to implement using Python's SAX API.
> 
> Here's a short example that does something similar to what you want to 
> achieve:
> 
> [...]
> 

I so far forgot to say a "thank you" for the suggestion :-)

The sample code as you sent it doesn't do what I need to do, but I did look
at it for creating SAX handler code that does what I want.

It took me a while to implement, as it didn't fit in the parser-engine I had
and I was close to making a release.

But still: thanks!

--Tim

-- 
View this message in context: 
http://www.nabble.com/Python-regex-question-tp17773487p18997385.html
Sent from the Python - python-list mailing list archive at Nabble.com.

--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-06 Thread Tobiah

On Tue, 05 Aug 2008 15:55:46 +0100, Fred Mangusta wrote:

> Chris wrote:
> 
>> Doesn't work for his use case as he wants to keep periods marking the
>> end of a sentence.

Doesn't it?  The period has to be surrounded by digits in the
example solution, so wouldn't periods followed by a space
(end of sentence) always make it through?

** Posted from http://www.teranews.com **
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread MRAB

On Aug 5, 11:39 am, Fred Mangusta <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I would like to delete all the instances of a '.' into a number.
>
> In other words I'd like to replace all the instances of a '.' character
> with something (say nothing at all) when the '.' is representing a
> decimal separator. E.g.
>
> 500.675  >       500675
>
> but also
>
> 1.000.456.344 > 1000456344
>
> I don't care about the fact the the resulting number is difficult to
> read: as long as it remains a series of digits it's ok: the important
> thing is to get rid of the period, because I want to keep it only where
> it marks the end of a sentence.
>
> I was trying to do like this
>
> s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
>
> but I don't know much about regular expressions, and don't know how to
> get the two groups of numbers and join them in the sub. Moreover doing
> like this I only match things like "345.000" and not "1.000.000".
>
> What's the correct approach?
>
I would use look-behind (is it preceded by a digit?) and look-ahead
(is it followed by a digit?):

s = re.sub(r'(?<=\d)\.(?=\d)', '', s)
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Fred Mangusta


Chris wrote:


Doesn't work for his use case as he wants to keep periods marking the
end of a sentence.


Exactly. Thanks to all of you anyway, now I have a better understanding 
on how to go on :)


F.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Chris

On Aug 5, 2:23 pm, Jeff <[EMAIL PROTECTED]> wrote:
> On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
>
>
>
> > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
> > > In other words I'd like to replace all the instances of a '.' character
> > > with something (say nothing at all) when the '.' is representing a
> > > decimal separator. E.g.
>
> > > 500.675  >       500675
>
> > > but also
>
> > > 1.000.456.344 > 1000456344
>
> > > I don't care about the fact the the resulting number is difficult to
> > > read: as long as it remains a series of digits it's ok: the important
> > > thing is to get rid of the period, because I want to keep it only where
> > > it marks the end of a sentence.
>
> > > I was trying to do like this
>
> > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
>
> > > but I don't know much about regular expressions, and don't know how to
> > > get the two groups of numbers and join them in the sub. Moreover doing
> > > like this I only match things like "345.000" and not "1.000.000".
>
> > > What's the correct approach?
>
> > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
> > Out[13]: '1000456344'
>
> > Ciao,
> >         Marc 'BlackJack' Rintsch
>
> Even faster:
>
> '1.000.456.344'.replace('.', '') => '1000456344'

Doesn't work for his use case as he wants to keep periods marking the
end of a sentence.
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Alexei Zankevich

=)
Indeed. But it will replace all dots including ordinary strings instead of
numbers only.

On Tue, Aug 5, 2008 at 3:23 PM, Jeff <[EMAIL PROTECTED]> wrote:

> On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
> > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
> > > In other words I'd like to replace all the instances of a '.' character
> > > with something (say nothing at all) when the '.' is representing a
> > > decimal separator. E.g.
> >
> > > 500.675  >   500675
> >
> > > but also
> >
> > > 1.000.456.344 > 1000456344
> >
> > > I don't care about the fact the the resulting number is difficult to
> > > read: as long as it remains a series of digits it's ok: the important
> > > thing is to get rid of the period, because I want to keep it only where
> > > it marks the end of a sentence.
> >
> > > I was trying to do like this
> >
> > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
> >
> > > but I don't know much about regular expressions, and don't know how to
> > > get the two groups of numbers and join them in the sub. Moreover doing
> > > like this I only match things like "345.000" and not "1.000.000".
> >
> > > What's the correct approach?
> >
> > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
> > Out[13]: '1000456344'
> >
> > Ciao,
> > Marc 'BlackJack' Rintsch
>
> Even faster:
>
> '1.000.456.344'.replace('.', '') => '1000456344'
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Jeff

On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
> On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
> > In other words I'd like to replace all the instances of a '.' character
> > with something (say nothing at all) when the '.' is representing a
> > decimal separator. E.g.
>
> > 500.675  >       500675
>
> > but also
>
> > 1.000.456.344 > 1000456344
>
> > I don't care about the fact the the resulting number is difficult to
> > read: as long as it remains a series of digits it's ok: the important
> > thing is to get rid of the period, because I want to keep it only where
> > it marks the end of a sentence.
>
> > I was trying to do like this
>
> > s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
>
> > but I don't know much about regular expressions, and don't know how to
> > get the two groups of numbers and join them in the sub. Moreover doing
> > like this I only match things like "345.000" and not "1.000.000".
>
> > What's the correct approach?
>
> In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
> Out[13]: '1000456344'
>
> Ciao,
>         Marc 'BlackJack' Rintsch

Even faster:

'1.000.456.344'.replace('.', '') => '1000456344'
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Alexei Zankevich

No, there is a bad way - because of the example doesn't solve arbitrary
amount of ... blocks.
But the python regexp engine supports for lookahead (?=pattern) and
lookbehind (?<=pattern).
In those cases patterns are not included into the replaced sequence of
characters:
>>> re.sub('(?<=\d)\.(?=\d)', '', '1234.324 abc.100.abc abc.abc')
'1234324 abc.100.abc abc.abc'

Alexey

On Tue, Aug 5, 2008 at 2:10 PM, Marc 'BlackJack' Rintsch <[EMAIL 
PROTECTED]>wrote:

> On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
>
> > In other words I'd like to replace all the instances of a '.' character
> > with something (say nothing at all) when the '.' is representing a
> > decimal separator. E.g.
> >
> > 500.675  >   500675
> >
> > but also
> >
> > 1.000.456.344 > 1000456344
> >
> > I don't care about the fact the the resulting number is difficult to
> > read: as long as it remains a series of digits it's ok: the important
> > thing is to get rid of the period, because I want to keep it only where
> > it marks the end of a sentence.
> >
> > I was trying to do like this
> >
> > s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
> >
> > but I don't know much about regular expressions, and don't know how to
> > get the two groups of numbers and join them in the sub. Moreover doing
> > like this I only match things like "345.000" and not "1.000.000".
> >
> > What's the correct approach?
>
> In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
> Out[13]: '1000456344'
>
>
> Ciao,
> Marc 'BlackJack' Rintsch
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-08-05 Thread Marc 'BlackJack' Rintsch

On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:

> In other words I'd like to replace all the instances of a '.' character
> with something (say nothing at all) when the '.' is representing a
> decimal separator. E.g.
> 
> 500.675  >   500675
> 
> but also
> 
> 1.000.456.344 > 1000456344
> 
> I don't care about the fact the the resulting number is difficult to
> read: as long as it remains a series of digits it's ok: the important
> thing is to get rid of the period, because I want to keep it only where
> it marks the end of a sentence.
> 
> I was trying to do like this
> 
> s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
> 
> but I don't know much about regular expressions, and don't know how to
> get the two groups of numbers and join them in the sub. Moreover doing
> like this I only match things like "345.000" and not "1.000.000".
> 
> What's the correct approach?

In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
Out[13]: '1000456344'

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list

regex question

2008-08-05 Thread Fred Mangusta


Hi,

I would like to delete all the instances of a '.' into a number.

In other words I'd like to replace all the instances of a '.' character 
with something (say nothing at all) when the '.' is representing a 
decimal separator. E.g.


500.675  >   500675

but also

1.000.456.344 > 1000456344

I don't care about the fact the the resulting number is difficult to 
read: as long as it remains a series of digits it's ok: the important 
thing is to get rid of the period, because I want to keep it only where 
it marks the end of a sentence.


I was trying to do like this

s=re.sub("[(\d+)(\.)(\d+)]","... ",s)

but I don't know much about regular expressions, and don't know how to 
get the two groups of numbers and join them in the sub. Moreover doing 
like this I only match things like "345.000" and not "1.000.000".


What's the correct approach?

Thanks
F.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python regex question

2008-06-11 Thread Gerhard Häring


Tim van der Leeuw wrote:

Hi,

I'm trying to create a regular expression for matching some particular 
XML strings. I want to extract the contents of a particular XML tag, 
only if it follows one tag, but not follows another tag. Complicating 
this, is that there can be any number of other tags in between. [...]


Sounds like this would be easier to implement using Python's SAX API.

Here's a short example that does something similar to what you want to 
achieve:


import xml.sax

test_str = """






"""

class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.ignore_next = False

def startElement(self, name, attrs):
if name == "ignore":
self.ignore_next = True
return
elif name == "foo":
if not self.ignore_next:
# handle the element you're interested in here
print "MY ELEMENT", name, "with", dict(attrs)

self.ignore_next = False

xml.sax.parseString(test_str, MyHandler())

In this case, this looks much clearer and easier to understand to me 
than regular expressions.


-- Gerhard

--
http://mail.python.org/mailman/listinfo/python-list

Python regex question

2008-06-11 Thread Tim van der Leeuw

Hi,

I'm trying to create a regular expression for matching some particular XML
strings. I want to extract the contents of a particular XML tag, only if it
follows one tag, but not follows another tag. Complicating this, is that
there can be any number of other tags in between.

So basically, my regular expression should have 3 parts:
- first match
- any random text, that should not contain string '.*?(?P\d+)'

(hopefully without typos)

Here '' is my first match, and '(?P\d+)'
is my second match.

In this expression, I want to change the generic '.*?', which matches
everything, with something that matches every string that does not include
the substring '--
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread Paul McGuire

On Feb 13, 6:53 am, mathieu <[EMAIL PROTECTED]> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = "      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width          SL   1 "
> patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")


I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = \
"  (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings  " \
"Auto Window Width  SL   1 "

patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")


Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:

patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

or if you prefer:

patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

  (xx42,xx0A)   Honeywell: Inverse Flitznoid (Kelvin)
80  SL   1


Just out of curiosity, I wondered what a pyparsing version of this
would look like.  See below:

from pyparsing import Word,hexnums,delimitedList,printables,\
White,Regex,nums

line = \
"  (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings  " \
"Auto Window Width  SL   1 "

# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")

# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
text("desc") + text("window") + type_label("type") + \
int_label("int")

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread grflanagan

On Feb 13, 1:53 pm, mathieu <[EMAIL PROTECTED]> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = "  (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width  SL   1 "
> patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
> m = patt.match(line)
> if m:
>   print m.group(3)
>   print m.group(4)


I don't know if it solves your problem, but if you want to match a
dash (-), then it must be either escaped or be the first element in a
character class.

Gerard
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread bearophileHUGS

mathieu, stop writing complex REs like obfuscated toys, use the
re.VERBOSE flag and split that RE into several commented and
*indented* lines (indented just like Python code), the indentation
level has to be used to denote nesting. With that you may be able to
solve the problem by yourself. If not, you can offer us a much more
readable thing to fix.

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2008-02-13 Thread Wanja Chresta

Hey Mathieu

Due to word wrap I'm not sure what you want to do. What result do you
expect? I get:
>>> print m.groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
Width  ', ' ', 'SL', '1')
But only when I insert a space in the 3rd char group (I'm not sure if
your original pattern has a space there or not). So the third group is:
([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
match the line.

I also cant see how the format of your line is. If it is like this:
line = "...Siemens: Thorax/Multix FD Lab Settings  Auto Window Width..."
where "Auto Window Width" should be the 4th group, you have to mark the
+ in the 3rd group as non-greedy (it's done with a "?"):
http://docs.python.org/lib/re-syntax.html
([A-Za-z0-9./:_ -]+?)
With that I get:
>>> patt.match(line).groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
Width ', 'SL', '1')
Which probably is what you want. You can also add the non-greedy marker
in the fourth group, to get rid of the tailing spaces.

HTH
Wanja


mathieu wrote:
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4

-- 
http://mail.python.org/mailman/listinfo/python-list

regex question

2008-02-13 Thread mathieu

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = "  (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
Auto Window Width  SL   1 "
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
m = patt.match(line)
if m:
  print m.group(3)
  print m.group(4)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a newbie regex question

2008-01-25 Thread Max Erickson

"Dotan Cohen" <[EMAIL PROTECTED]> wrote:
> Maybe you mean:
> for match in re.finditer(r'\([A-Z].+[a-z])\', contents):
> Note the last backslash was in the wrong place.

The location of the backslash in the orignal reply is correct, it is 
there to escape the closing paren, which is a special character:

>>> import re
>>> s='Abcd\nabc (Ab), (ab)'
>>> re.findall(r'\([A-Z].+[a-z]\)', s)
['(Ab), (ab)']

Putting the backslash at the end of the string like you indicated 
results in a syntax error, as it escapes the closing single quote of 
the raw string literal: 

>>> re.findall(r'\([A-Z].+[a-z])\', s)
   
SyntaxError: EOL while scanning single-quoted string
>>> 


max


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a newbie regex question

2008-01-25 Thread Dotan Cohen

On 24/01/2008, Jonathan Gardner <[EMAIL PROTECTED]> wrote:
> On Jan 24, 12:14 pm, Shoryuken <[EMAIL PROTECTED]> wrote:
> > Given a regular expression pattern, for example, \([A-Z].+[a-z]\),
> >
> > print out all strings that match the pattern in a file
> >
> > Anyone tell me a way to do it? I know it's easy, but i'm completely
> > new to python
> >
> > thanks alot
>
> You may want to read the pages on regular expressions in the online
> documentation: http://www.python.org/doc/2.5/lib/module-re.html
>
> The simple approach works:
>
>   import re
>
>   # Open the file
>   f = file('/your/filename.txt')
>
>   # Read the file into a single string.
>   contents = f.read()
>
>   # Find all matches in the string of the regular expression and
> iterate through them.
>   for match in re.finditer(r'\([A-Z].+[a-z]\)', contents):
> # Print what was matched
> print match.group()

Maybe you mean:
for match in re.finditer(r'\([A-Z].+[a-z])\', contents):

Note the last backslash was in the wrong place.

Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a newbie regex question

2008-01-24 Thread Jonathan Gardner

On Jan 24, 12:14 pm, Shoryuken <[EMAIL PROTECTED]> wrote:
> Given a regular expression pattern, for example, \([A-Z].+[a-z]\),
>
> print out all strings that match the pattern in a file
>
> Anyone tell me a way to do it? I know it's easy, but i'm completely
> new to python
>
> thanks alot

You may want to read the pages on regular expressions in the online
documentation: http://www.python.org/doc/2.5/lib/module-re.html

The simple approach works:

  import re

  # Open the file
  f = file('/your/filename.txt')

  # Read the file into a single string.
  contents = f.read()

  # Find all matches in the string of the regular expression and
iterate through them.
  for match in re.finditer(r'\([A-Z].+[a-z]\)', contents):
# Print what was matched
print match.group()
-- 
http://mail.python.org/mailman/listinfo/python-list

a newbie regex question

2008-01-24 Thread Shoryuken

Given a regular expression pattern, for example, \([A-Z].+[a-z]\),

print out all strings that match the pattern in a file

Anyone tell me a way to do it? I know it's easy, but i'm completely
new to python

thanks alot
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread Gabriel Genellina

En Sun, 09 Dec 2007 16:45:53 -0300, charonzen <[EMAIL PROTECTED]>  
escribió:

>> [John Machin] Another suggestion is to ensure that the job  
>> specification is not
>> overly simplified. How did you parse the text into "words" in the
>> prior exercise that produced the list of bigrams? Won't you need to
>> use the same parsing method in the current exercise of tagging the
>> bigrams with an underscore?
>
> Thank you John, that definitely puts things in perspective!  I'm very
> new to both Python and text parsing, and I often feel that I can't see
> the forest for the trees.  If you're asking, I'm working on a project
> that utilizes Church's mutual information score.  I tokenize my text,
> split it into a list, derive some unigram and bigram dictionaries, and
> then calculate a pmi dictionary based on x,y from the bigrams and
> unigrams.  The bigrams that pass my threshold then get put into my
> list of x_y strings, and you know the rest.  By modifying the original
> text file, I can view 'x_y', z pairs as x,y and iterate it until I
> have some collocations that are worth playing with.  So I think that
> covers the question the same parsing method.  I'm sure there are more
> pythonic ways to do it, but I'm on deadline :)

Looks like you should work with the list of tokens, collapsing consecutive  
elements, not with the original text. Should be easier, and faster because  
you don't regenerate the text and tokenize it again and again.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread charonzen


> Another suggestion is to ensure that the job specification is not
> overly simplified. How did you parse the text into "words" in the
> prior exercise that produced the list of bigrams? Won't you need to
> use the same parsing method in the current exercise of tagging the
> bigrams with an underscore?
>
> Cheers,
> John

Thank you John, that definitely puts things in perspective!  I'm very
new to both Python and text parsing, and I often feel that I can't see
the forest for the trees.  If you're asking, I'm working on a project
that utilizes Church's mutual information score.  I tokenize my text,
split it into a list, derive some unigram and bigram dictionaries, and
then calculate a pmi dictionary based on x,y from the bigrams and
unigrams.  The bigrams that pass my threshold then get put into my
list of x_y strings, and you know the rest.  By modifying the original
text file, I can view 'x_y', z pairs as x,y and iterate it until I
have some collocations that are worth playing with.  So I think that
covers the question the same parsing method.  I'm sure there are more
pythonic ways to do it, but I'm on deadline :)

Thanks again!

Brandon
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread John Machin

On Dec 9, 6:13 pm, charonzen <[EMAIL PROTECTED]> wrote:

The following *may* come close to doing what your revised spec
requires:

import re
def ch_replace2(alist, text):
for bigram in alist:
pattern = r'\b' + bigram.replace('_', ' ') + r'\b'
text = re.sub(pattern, bigram, text)
return text

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python/regex question... hope someone can help

2007-12-09 Thread John Machin

On Dec 9, 6:13 pm, charonzen <[EMAIL PROTECTED]> wrote:
> I have a list of strings.  These strings are previously selected
> bigrams with underscores between them ('and_the', 'nothing_given', and
> so on).  I need to write a regex that will read another text string
> that this list was derived from and replace selections in this text
> string with those from my list.  So in my text string, '... and the...
> ' becomes ' ... and_the...'.   I can't figure out how to manipulate
>
> re.sub(r'([a-z]*) ([a-z]*)', r'()', textstring)
>
> Any suggestions?

The usual suggestion is: Don't bother with regexes when simple string
methods will do the job.

>>> def ch_replace(alist, text):
... for bigram in alist:
... original = bigram.replace('_', ' ')
... text = text.replace(original, bigram)
... return text
...
>>> print ch_replace(
... ['quick_brown', 'lazy_dogs', 'brown_fox'],
... 'The quick brown fox jumped over the lazy dogs.'
... )
The quick_brown_fox jumped over the lazy_dogs.
>>> print ch_replace(['red_herring'], 'He prepared herring fillets.')
He prepared_herring fillets.
>>>

Another suggestion is to ensure that the job specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list

python/regex question... hope someone can help

2007-12-08 Thread charonzen

I have a list of strings.  These strings are previously selected
bigrams with underscores between them ('and_the', 'nothing_given', and
so on).  I need to write a regex that will read another text string
that this list was derived from and replace selections in this text
string with those from my list.  So in my text string, '... and the...
' becomes ' ... and_the...'.   I can't figure out how to manipulate

re.sub(r'([a-z]*) ([a-z]*)', r'()', textstring)

Any suggestions?

Thank you if you can help!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread John Masters

On 15:25 Thu 04 Oct , Robert Dailey wrote:
> I am not a regex expert, I simply assumed regex was standardized to follow
> specific guidelines.

There are as many different regex flavours as there are Linux distros.
Each follows the basic rules but implements them slightly differently
and adds their own 'extensions'. 

> I also made the assumption that this was a good place
> to pose the question since regular expressions are a feature of Python.

The best place to pose a regex question is in the sphere of usage, i.e.
Perl regexes differ hugely in implementation from OO langs like Python
or Java, while shells like bash or zsh use regexes slightly differently,
as do shell scripting languages like awk or sed. 

> The question concerned regular expressions in general, not really the
> application. However, now that I know that regex can be different, I'll try
> to contact the author directly to find out the dialect and then find the
> appropriate location for my question from there. I do appreciate everyone's
> help. I've tried the various suggestions offered here, however none of them
> work. I can only assume at this point that this regex is drastically
> different or the application reading the regex is just broken.

If you care to PM me with details of the language/context I will try to
help but I am no expert.

Regards, John
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

I am not a regex expert, I simply assumed regex was standardized to follow
specific guidelines. I also made the assumption that this was a good place
to pose the question since regular expressions are a feature of Python. The
question concerned regular expressions in general, not really the
application. However, now that I know that regex can be different, I'll try
to contact the author directly to find out the dialect and then find the
appropriate location for my question from there. I do appreciate everyone's
help. I've tried the various suggestions offered here, however none of them
work. I can only assume at this point that this regex is drastically
different or the application reading the regex is just broken.

Thanks again for everyones help!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Tim Chase

[sigh...replying to my own post]
> However, things to try:
> 
> - sometimes the grouping parens need to be escaped with "\"
> 
> - sometimes "\w" isn't a valid character class, so use the 
> long-hand variant of something like "[a-zA-Z0-9_]]
> 
> - sometimes the "+" is escaped with a "\"
> 
> - if you don't use raw strings, you'll need to escape your "\" 
> characters, making each instance "\\"

just to be clear...these are some variants you may find in 
non-python regexps (or in python regexps if you're not using raw 
strings)

-tkc




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Tim Chase

>>> try @param\[(in|out)\] \w+
>>>
>> This didn't work either :(
>>
>> The tool using this regular expression (Comment Reflower for VS2005) May be
>> broken...
> 
> How about @param\[[i|o][n|u]t*\]\w+ ?

...if you want to accept patterns like

   @param[iutt]xxx

...

The regexp at the top (Adam's original reply) would be the valid 
regexp in python and matches all the tests thrown at it, assuming 
it's placed in a raw string:

   r = re.compile(r"@param\[(in|out)\] \w+")

If it's not a python regexp, this isn't really the list for the 
question, is it? ;)

However, things to try:

- sometimes the grouping parens need to be escaped with "\"

- sometimes "\w" isn't a valid character class, so use the 
long-hand variant of something like "[a-zA-Z0-9_]]

- sometimes the "+" is escaped with a "\"

- if you don't use raw strings, you'll need to escape your "\" 
characters, making each instance "\\"

HTH,

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Manu Hack

On 10/4/07, Robert Dailey <[EMAIL PROTECTED]> wrote:
> On 10/4/07, Adam Lanier <[EMAIL PROTECTED]> wrote:
> >
> > try @param\[(in|out)\] \w+
> >
>
> This didn't work either :(
>
> The tool using this regular expression (Comment Reflower for VS2005) May be
> broken...
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

How about @param\[[i|o][n|u]t*\]\w+ ?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Jerry Hill

> As far as the dialect, I can't be sure. I am unable to find documentation
> for Comment Reflower and thus cannot figure out what type of regex it is
> using. What exactly do you mean by your question, "are you using raw
> strings?". Thanks for your response and I apologize for the lack of detail.

Comment Reflower appears to be a plugin for Visual Studio written in
C#.  As far as I can tell, it has nothing to do with Python at all.

A quick look at their sourceforge page
(http://sourceforge.net/projects/commentreflower/) doesn't show any
mailing lists or discussion groups.  Maybe try emailing the author
directly, or asking a C# language group about whatever the standard C#
regular expression library is.

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

On 10/4/07, J. Clifford Dyer <[EMAIL PROTECTED]> wrote:
>
> You *are* talking about python regular expressions, right?  There are a
> number of different dialects.  Also, there could be issues with the quoting
> method (are you using raw strings?)
>
> The more specific you can get, the more we can help you.


As far as the dialect, I can't be sure. I am unable to find documentation
for Comment Reflower and thus cannot figure out what type of regex it is
using. What exactly do you mean by your question, "are you using raw
strings?". Thanks for your response and I apologize for the lack of detail.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread J. Clifford Dyer

You *are* talking about python regular expressions, right?  There are a number 
of different dialects.  Also, there could be issues with the quoting method 
(are you using raw strings?)  

The more specific you can get, the more we can help you.

Cheers,
Cliff
On Thu, Oct 04, 2007 at 11:54:32AM -0500, Robert Dailey wrote regarding Re: 
RegEx question:
> 
>On 10/4/07, Adam Lanier <[EMAIL PROTECTED]> wrote:
> 
>  try @param\[(in|out)\] \w+
> 
>This didn't work either :(
>The tool using this regular expression (Comment Reflower for VS2005)
>May be broken...
> 
> References
> 
>1. mailto:[EMAIL PROTECTED]

> -- 
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

On 10/4/07, Adam Lanier <[EMAIL PROTECTED]> wrote:
>
>
> try @param\[(in|out)\] \w+
>

This didn't work either :(

The tool using this regular expression (Comment Reflower for VS2005) May be
broken...
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Adam Lanier

On Thu, 2007-10-04 at 10:58 -0500, Robert Dailey wrote:
> It should also match:
> 
> @param[out] state Some description of this variable
> 
> 
> On 10/4/07, Robert Dailey <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> The following regex (Not including the end quotes):
> 
> "@param\[in|out\] \w+ "
> 
> Should match any of the following:
> 
> @param[in] variable 
> @param[out] state 
> @param[in] foo 
> @param[out] bar 
> 
> 
> Correct? (Note the trailing whitespace in the regex as well as
> in the examples)
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list

try @param\[(in|out)\] \w+ 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RegEx question

2007-10-04 Thread Robert Dailey

It should also match:

@param[out] state Some description of this variable


On 10/4/07, Robert Dailey <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> The following regex (Not including the end quotes):
>
> "@param\[in|out\] \w+ "
>
> Should match any of the following:
>
> @param[in] variable
> @param[out] state
> @param[in] foo
> @param[out] bar
>
>
> Correct? (Note the trailing whitespace in the regex as well as in the
> examples)
>
-- 
http://mail.python.org/mailman/listinfo/python-list

RegEx question

2007-10-04 Thread Robert Dailey

Hi,

The following regex (Not including the end quotes):

"@param\[in|out\] \w+ "

Should match any of the following:

@param[in] variable
@param[out] state
@param[in] foo
@param[out] bar


Correct? (Note the trailing whitespace in the regex as well as in the
examples)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-21 Thread David

> re.search(expr, string) compiles and searches every time. This can
> potentially be more expensive in calculating power. especially if you
> have to use the expression a lot of times.

The re module-level helper functions cache expressions and their
compiled form in a dict. They are only compiled once. The main
overhead would be for repeated dict lookups.

See sre.py (included from re.py) for more details. /usr/lib/python2.4/sre.py
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-21 Thread Ivo

crybaby wrote:
> On Sep 20, 4:12 pm, Tobiah <[EMAIL PROTECTED]> wrote:
>> [EMAIL PROTECTED] wrote:
>>> I need to extract the number on each >> i.e 49.950 from the following:
>>>  49.950 
>>> The actual number between:  49.950  can be any number of
>>> digits before decimal and after decimal.
>>>  ##. 
>>> How can I just extract the real/integer number using regex?
>> '[0-9]*\.[0-9]*'
>>
>> --
>> Posted via a free Usenet account fromhttp://www.teranews.com
> 
> I am trying to use BeautifulSoup:
> 
> soup = BeautifulSoup(page)
> 
> td_tags = soup.findAll('td')
> i=0
> for td in td_tags:
> i = i+1
> print "td: ", td
> # re.search('[0-9]*\.[0-9]*', td)
> price = re.compile('[0-9]*\.[0-9]*').search(td)
> 
> I am getting an error:
> 
>price= re.compile('[0-9]*\.[0-9]*').search(td)
> TypeError: expected string or buffer
> 
> Does beautiful soup returns array of objects? If so, how do I pass
> "td" instance as string to re.search?  What is the different between
> re.search vs re.compile().search?
> 

I don't know anything about BeautifulSoup, but to the other questions:

var=re.compile(regexpr) compiles the expression and after that you can 
use var as the reference to that compiled expression (costs less)

re.search(expr, string) compiles and searches every time. This can 
potentially be more expensive in calculating power. especially if you 
have to use the expression a lot of times.

The way you use it it doesn't matter.

do:
pattern = re.compile('[0-9]*\.[0-9]*')
result = pattern.findall(your tekst here)

Now you can reuse pattern.

Cheers,
Ivo.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-20 Thread crybaby

On Sep 20, 4:12 pm, Tobiah <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
> > I need to extract the number on each 
> > i.e 49.950 from the following:
>
> >  49.950 
>
> > The actual number between:  49.950  can be any number of
> > digits before decimal and after decimal.
>
> >  ##. 
>
> > How can I just extract the real/integer number using regex?
>
> '[0-9]*\.[0-9]*'
>
> --
> Posted via a free Usenet account fromhttp://www.teranews.com

I am trying to use BeautifulSoup:

soup = BeautifulSoup(page)

td_tags = soup.findAll('td')
i=0
for td in td_tags:
i = i+1
print "td: ", td
# re.search('[0-9]*\.[0-9]*', td)
price = re.compile('[0-9]*\.[0-9]*').search(td)

I am getting an error:

   price= re.compile('[0-9]*\.[0-9]*').search(td)
TypeError: expected string or buffer

Does beautiful soup returns array of objects? If so, how do I pass
"td" instance as string to re.search?  What is the different between
re.search vs re.compile().search?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-20 Thread Gerardo Herzig

[EMAIL PROTECTED] wrote:

>I need to extract the number on each 
>i.e 49.950 from the following:
>
> 49.950 
>
>The actual number between:  49.950  can be any number of
>digits before decimal and after decimal.
>
> ##. 
>
>How can I just extract the real/integer number using regex?
>
>  
>
If all the td's content has the  [value_to_extract]  pattern, 
things goes simplest

[untested]

/http://mail.python.org/mailman/listinfo/python-list

Re: Python Regex Question

2007-09-20 Thread Tobiah

[EMAIL PROTECTED] wrote:
> I need to extract the number on each  
> i.e 49.950 from the following:
> 
>  49.950 
> 
> The actual number between:  49.950  can be any number of
> digits before decimal and after decimal.
> 
>  ##. 
> 
> How can I just extract the real/integer number using regex?
> 


'[0-9]*\.[0-9]*'

-- 
Posted via a free Usenet account from http://www.teranews.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Python Regex Question

2007-09-20 Thread joemystery123

I need to extract the number on each  49.950 

The actual number between:  49.950  can be any number of
digits before decimal and after decimal.

 ##. 

How can I just extract the real/integer number using regex?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-12 Thread James T. Dennis

johnny <[EMAIL PROTECTED]> wrote:
> I need to get the content inside the bracket.

> eg. some characters before bracket (3.12345).
> I need to get whatever inside the (), in this case 3.12345.
> How do you do this with python regular expression?

 I'm going to presume that you mean something like:

I want to extract floating point numerics from parentheses
embedded in other, arbitrary, text.

 Something like:

>>> given='adfasdfafd(3.14159265)asdfasdfadsfasf'
>>> import re
>>> mymatch = re.search(r'\(([0-9.]+)\)', given).groups()[0]
>>> mymatch
'3.14159265'
>>> 

 Of course, as with any time you're contemplating the use of regular
 expressions, there are lots of questions to consider about the exact
 requirements here.  What if there are more than such pattern?  Do you
 only want the first match per line (or other string)?  (That's all my
 example will give you).  What if there are no matches?  My example
 will raise an AttributeError (since the re.search will return the
 "None" object rather than a match object; and naturally the None
 object has no ".groups()' method.

 The following might work better:

>>> mymatches = re.findall(r'\(([0-9.]+)\)', given).groups()[0]
>>> if len(mymatches):
>>> ...

 ... and, of couse, you might be better with a compiled regexp if
 you're going to repeast the search on many strings:

num_extractor = re.compile(r'\(([0-9.]+)\)')
for line in myfile:
for num in num_extractor(line):
pass
# do whatever with all these numbers

-- 
Jim Dennis,
Starshine: Signed, Sealed, Delivered

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-11 Thread Steven D'Aprano

On Fri, 11 May 2007 08:54:31 -0700, johnny wrote:

> I need to get the content inside the bracket.
> 
> eg. some characters before bracket (3.12345).
> 
> I need to get whatever inside the (), in this case 3.12345.
> 
> How do you do this with python regular expression?

Why would you bother? If you know your string is a bracketed expression,
all you need is:

s = "(3.12345)"
contents = s[1:-1] # ignore the first and last characters

If your string is more complex:

s = "lots of things here (3.12345) and some more things here"

then the task is harder. In general, you can't use regular expressions for
that, you need a proper parser, because brackets can be nested.

But if you don't care about nested brackets, then something like this is
easy:

def get_bracket(s):
p, q = s.find('('), s.find(')')
if p == -1 or q == -1: raise ValueError("Missing bracket")
if p > q: raise ValueError("Close bracket before open bracket")
return s[p+1:q-1]

Or as a one liner with no error checking:

s[s.find('(')+1:s.find(')'-1]

-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-11 Thread John Machin

On May 12, 2:21 am, Gary Herron <[EMAIL PROTECTED]> wrote:
> johnny wrote:
> > I need to get the content inside the bracket.
>
> > eg. some characters before bracket (3.12345).
>
> > I need to get whatever inside the (), in this case 3.12345.
>
> > How do you do this with python regular expression?
>
> >>> import re
> >>> x = re.search("[0-9.]+", "(3.12345)")
> >>> print x.group(0)
>
> 3.12345
>
> There's a lot more to the re module, of course.  I'd suggest reading the
> manual, but this should get you started.
>

>>> s = "some chars like 987 before the bracket (3.12345) etc"
>>> x = re.search("[0-9.]+", s)
>>> x.group(0)
'987'

OP sez: "I need to get the content inside the bracket"
OP sez: "I need to get whatever inside the ()"

My interpretation:

>>> for s in ['foo(123)bar', 'foo(123))bar', 'foo()bar', 'foobar']:
... x = re.search(r"\([^)]*\)", s)
... print repr(x and x.group(0)[1:-1])
...
'123'
'123'
''
None




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Python REGEX Question

2007-05-11 Thread Gary Herron

johnny wrote:
> I need to get the content inside the bracket.
>
> eg. some characters before bracket (3.12345).
>
> I need to get whatever inside the (), in this case 3.12345.
>
> How do you do this with python regular expression?
>   

>>> import re
>>> x = re.search("[0-9.]+", "(3.12345)")
>>> print x.group(0)
3.12345

There's a lot more to the re module, of course.  I'd suggest reading the
manual, but this should get you started.


Gary Herron

-- 
http://mail.python.org/mailman/listinfo/python-list

Simple Python REGEX Question

2007-05-11 Thread johnny

I need to get the content inside the bracket.

eg. some characters before bracket (3.12345).

I need to get whatever inside the (), in this case 3.12345.

How do you do this with python regular expression?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-28 Thread proctor

On Apr 27, 8:26 am, Michael Hoffman <[EMAIL PROTECTED]> wrote:
> proctorwrote:
> > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
> >> On Apr 27, 1:33 am,proctor<[EMAIL PROTECTED]> wrote:
> >>> rx_test = re.compile('/x([^x])*x/')
> >>> s = '/xabcx/'
> >>> if rx_test.findall(s):
> >>> print rx_test.findall(s)
> >>> 
> >>> i expect the output to be ['abc'] however it gives me only the last
> >>> single character in the group: ['c']
>
> >> As Josiah already pointed out, the * needs to be inside the grouping
> >> parens.
> > so my question remains, why doesn't the star quantifier seem to grab
> > all the data.
>
> Because you didn't use it *inside* the group, as has been said twice.
> Let's take a simpler example:
>
>  >>> import re
>  >>> text = "xabc"
>  >>> re_test1 = re.compile("x([^x])*")
>  >>> re_test2 = re.compile("x([^x]*)")
>  >>> re_test1.match(text).groups()
> ('c',)
>  >>> re_test2.match(text).groups()
> ('abc',)
>
> There are three places that match ([^x]) in text. But each time you find
> one you overwrite the previous example.
>
> > isn't findall() intended to return all matches?
>
> It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
> a grouping parenthesis in there, it only returns one group from each
> pattern.
>
> Back to my example:
>
>  >>> re_test1.findall("xabcxaaaxabc")
> ['c', 'a', 'c']
>
> Here it finds multiple matches, but only because the x occurs multiple
> times as well. In your example there is only one match.
>
> > i would expect either 'abc' or 'a', 'b', 'c' or at least just
> > 'a' (because that would be the first match).
>
> You are essentially doing this:
>
> group1 = "a"
> group1 = "b"
> group1 = "c"
>
> After those three statements, you wouldn't expect group1 to be "abc" or
> "a". You'd expect it to be "c".
> --
> Michael Hoffman

thank you all again for helping to clarify this for me.  of course you
were exactly right, and the problem lay not with python or the text,
but with me.  i mistakenly understood the text to be attempting to
capture the C style comment, when in fact it was merely matching it.

apologies.

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 8:50 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
> On Apr 27, 9:10 am, proctor <[EMAIL PROTECTED]> wrote:
>
>
>
> > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
>
> > > On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote:
>
> > > > hello,
>
> > > > i have a regex:  rx_test = re.compile('/x([^x])*x/')
>
> > > > which is part of this test program:
>
> > > > 
>
> > > > import re
>
> > > > rx_test = re.compile('/x([^x])*x/')
>
> > > > s = '/xabcx/'
>
> > > > if rx_test.findall(s):
> > > > print rx_test.findall(s)
>
> > > > 
>
> > > > i expect the output to be ['abc'] however it gives me only the last
> > > > single character in the group: ['c']
>
> > > > C:\test>python retest.py
> > > > ['c']
>
> > > > can anyone point out why this is occurring?  i can capture the entire
> > > > group by doing this:
>
> > > > rx_test = re.compile('/x([^x]+)*x/')
> > > > but why isn't the 'star' grabbing the whole group?  and why isn't each
> > > > letter 'a', 'b', and 'c' present, either individually, or as a group
> > > > (group is expected)?
>
> > > > any clarification is appreciated!
>
> > > > sincerely,
> > > > proctor
>
> > > As Josiah already pointed out, the * needs to be inside the grouping
> > > parens.
>
> > > Since re's do lookahead/backtracking, you can also write:
>
> > > rx_test = re.compile('/x(.*?)x/')
>
> > > The '?' is there to make sure the .* repetition stops at the first
> > > occurrence of x/.
>
> > > -- Paul
>
> > i am working through an example from the oreilly book mastering
> > regular expressions (2nd edition) by jeffrey friedl.  my post was a
> > snippet from a regex to match C comments.   every 'x' in the regex
> > represents a 'star' in actual usage, so that backslash escaping is not
> > needed in the example (on page 275).  it looks like this:
>
> > ===
>
> > /x([^x]|x+[^/x])*x+/
>
> > it is supposed to match '/x', the opening delimiter, then
>
> > (
> > either anything that is 'not x',
>
> > or,
>
> > 'x' one or more times, 'not followed by a slash or an x'
> > ) any number of times (the 'star')
>
> > followed finally by the closing delimiter.
>
> > ===
>
> > this does not seem to work in python the way i understand it should
> > from the book, and i simplified the example in my first post to
> > concentrate on just one part of the alternation that i felt was not
> > acting as expected.
>
> > so my question remains, why doesn't the star quantifier seem to grab
> > all the data.  isn't findall() intended to return all matches?  i
> > would expect either 'abc' or 'a', 'b', 'c' or at least just
> > 'a' (because that would be the first match).  why does it give only
> > one letter, and at that, the /last/ letter in the sequence??
>
> > thanks again for replying!
>
> > sincerely,
> > proctor- Hide quoted text -
>
> > - Show quoted text -
>
> Again, I'll repeat some earlier advice:  you need to move the '*'
> inside the parens - you are still leaving it outside.  Also, get in
> the habit of using raw literal notation (that is r"slkjdfljf" instead
> of "lsjdlfkjs") when defining re strings - you don't have backslash
> issues yet, but you will as soon as you start putting real '*'
> characters in your expression.
>
> However, when I test this,
>
> restr = r'/x(([^x]|x+[^/])*)x+/'
> re_ = re.compile(restr)
> print re_.findall("/xabxxcx/ /x123xxx/")
>
> findall now starts to give a tuple for each "comment",
>
> [('abxxc', 'xxc'), ('123xx', 'xx')]
>
> so you have gone beyond my limited re skill, and will need help from
> someone else.
>
> But I suggest you add some tests with multiple consecutive 'x'
> characters in the middle of your comment, and multiple consecutive 'x'
> characters before the trailing comment.  In fact, from my
> recollections of trying to implement this type of comment recognizer
> by hand a long time ago in a job far, far away, test with both even
> and odd numbers of 'x' characters.
>
> -- Paul

thanks paul,

the reason the regex now give tuples is that there are now 2 groups,
the inner and outer parens.  so group 1 matches with the star, and
group 2 matches without the star.

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Duncan Booth

proctor <[EMAIL PROTECTED]> wrote:

>> >>> re.findall('(.)*', 'abc')
>> ['c', '']

> thank you this is interesting.  in the second example, where does the
> 'nothingness' match, at the end?  why does the regex 'run again' when
> it has already matched everything?  and if it reports an empty match
> along with a non-empty match, why only the two?
> 

There are 4 possible starting points for a regular expression to match in a 
three character string. The regular expression would match at any starting 
point so in theory you could find 4 possible matches in the string. In this 
case they would be 'abc', 'bc', 'c', ''.

However findall won't get any overlapping matches, so there are only two 
possible matches and it returns both of them: 'abc' and '' (or rather it 
returns the matching group within the match so you only see the 'c' 
although it matched 'abc'.

If you use a regex which doesn't match an empty string (e.g. '/x(.*?)x/' 
then you won't get the empty match.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 8:37 am, Duncan Booth <[EMAIL PROTECTED]> wrote:
> proctor <[EMAIL PROTECTED]> wrote:
> > so my question remains, why doesn't the star quantifier seem to grab
> > all the data.  isn't findall() intended to return all matches?  i
> > would expect either 'abc' or 'a', 'b', 'c' or at least just
> > 'a' (because that would be the first match).  why does it give only
> > one letter, and at that, the /last/ letter in the sequence??
>
> findall returns the matched groups. You get one group for each
> parenthesised sub-expression, and (the important bit) if a single
> parenthesised expression matches more than once the group only contains
> the last string which matched it.
>
> Putting a star after a subexpression means that subexpression can match
> zero or more times, but each time it only matches a single character
> which is why your findall only returned the last character it matched.
>
> You need to move the * inside the parentheses used to define the group,
> then the group will match only once but will include everything that it
> matched.
>
> Consider:
>
> >>> re.findall('(.)', 'abc')
> ['a', 'b', 'c']
> >>> re.findall('(.)*', 'abc')
> ['c', '']
> >>> re.findall('(.*)', 'abc')
>
> ['abc', '']
>
> The first pattern finds a single character which findall manages to
> match 3 times.
>
> The second pattern finds a group with a single character zero or more
> times in the pattern, so the first time it matches each of a,b,c in turn
> and returns the c, and then next time around we get an empty string when
> group matched zero times.
>
> In the third pattern we are looking for a group with any number of
> characters in it. First time we get all of the string, then we get
> another empty match.

thank you this is interesting.  in the second example, where does the
'nothingness' match, at the end?  why does the regex 'run again' when
it has already matched everything?  and if it reports an empty match
along with a non-empty match, why only the two?

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 8:26 am, Michael Hoffman <[EMAIL PROTECTED]> wrote:
> proctor wrote:
> > On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
> >> On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote:
> >>> rx_test = re.compile('/x([^x])*x/')
> >>> s = '/xabcx/'
> >>> if rx_test.findall(s):
> >>> print rx_test.findall(s)
> >>> 
> >>> i expect the output to be ['abc'] however it gives me only the last
> >>> single character in the group: ['c']
>
> >> As Josiah already pointed out, the * needs to be inside the grouping
> >> parens.
> > so my question remains, why doesn't the star quantifier seem to grab
> > all the data.
>
> Because you didn't use it *inside* the group, as has been said twice.
> Let's take a simpler example:
>
>  >>> import re
>  >>> text = "xabc"
>  >>> re_test1 = re.compile("x([^x])*")
>  >>> re_test2 = re.compile("x([^x]*)")
>  >>> re_test1.match(text).groups()
> ('c',)
>  >>> re_test2.match(text).groups()
> ('abc',)
>
> There are three places that match ([^x]) in text. But each time you find
> one you overwrite the previous example.
>
> > isn't findall() intended to return all matches?
>
> It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
> a grouping parenthesis in there, it only returns one group from each
> pattern.
>
> Back to my example:
>
>  >>> re_test1.findall("xabcxaaaxabc")
> ['c', 'a', 'c']
>
> Here it finds multiple matches, but only because the x occurs multiple
> times as well. In your example there is only one match.
>
> > i would expect either 'abc' or 'a', 'b', 'c' or at least just
> > 'a' (because that would be the first match).
>
> You are essentially doing this:
>
> group1 = "a"
> group1 = "b"
> group1 = "c"
>
> After those three statements, you wouldn't expect group1 to be "abc" or
> "a". You'd expect it to be "c".
> --
> Michael Hoffman

ok, thanks michael.

so i am now assuming that either the book's example assumes perl, and
perl is different from python in this regard, or, that the book's
example is faulty.  i understand all the examples given since my
question, and i know what i need to do to make it work.  i am raising
the question because the book says one thing, but the example is not
working for me.  i am searching for the source of the discrepancy.

i will try to research the differences between perl's and python's
regex engines.

thanks again,

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Paul McGuire

On Apr 27, 9:10 am, proctor <[EMAIL PROTECTED]> wrote:
> On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
>
>
>
>
>
> > On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote:
>
> > > hello,
>
> > > i have a regex:  rx_test = re.compile('/x([^x])*x/')
>
> > > which is part of this test program:
>
> > > 
>
> > > import re
>
> > > rx_test = re.compile('/x([^x])*x/')
>
> > > s = '/xabcx/'
>
> > > if rx_test.findall(s):
> > > print rx_test.findall(s)
>
> > > 
>
> > > i expect the output to be ['abc'] however it gives me only the last
> > > single character in the group: ['c']
>
> > > C:\test>python retest.py
> > > ['c']
>
> > > can anyone point out why this is occurring?  i can capture the entire
> > > group by doing this:
>
> > > rx_test = re.compile('/x([^x]+)*x/')
> > > but why isn't the 'star' grabbing the whole group?  and why isn't each
> > > letter 'a', 'b', and 'c' present, either individually, or as a group
> > > (group is expected)?
>
> > > any clarification is appreciated!
>
> > > sincerely,
> > > proctor
>
> > As Josiah already pointed out, the * needs to be inside the grouping
> > parens.
>
> > Since re's do lookahead/backtracking, you can also write:
>
> > rx_test = re.compile('/x(.*?)x/')
>
> > The '?' is there to make sure the .* repetition stops at the first
> > occurrence of x/.
>
> > -- Paul
>
> i am working through an example from the oreilly book mastering
> regular expressions (2nd edition) by jeffrey friedl.  my post was a
> snippet from a regex to match C comments.   every 'x' in the regex
> represents a 'star' in actual usage, so that backslash escaping is not
> needed in the example (on page 275).  it looks like this:
>
> ===
>
> /x([^x]|x+[^/x])*x+/
>
> it is supposed to match '/x', the opening delimiter, then
>
> (
> either anything that is 'not x',
>
> or,
>
> 'x' one or more times, 'not followed by a slash or an x'
> ) any number of times (the 'star')
>
> followed finally by the closing delimiter.
>
> ===
>
> this does not seem to work in python the way i understand it should
> from the book, and i simplified the example in my first post to
> concentrate on just one part of the alternation that i felt was not
> acting as expected.
>
> so my question remains, why doesn't the star quantifier seem to grab
> all the data.  isn't findall() intended to return all matches?  i
> would expect either 'abc' or 'a', 'b', 'c' or at least just
> 'a' (because that would be the first match).  why does it give only
> one letter, and at that, the /last/ letter in the sequence??
>
> thanks again for replying!
>
> sincerely,
> proctor- Hide quoted text -
>
> - Show quoted text -

Again, I'll repeat some earlier advice:  you need to move the '*'
inside the parens - you are still leaving it outside.  Also, get in
the habit of using raw literal notation (that is r"slkjdfljf" instead
of "lsjdlfkjs") when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall("/xabxxcx/ /x123xxx/")

findall now starts to give a tuple for each "comment",

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment.  In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Duncan Booth

proctor <[EMAIL PROTECTED]> wrote:

> so my question remains, why doesn't the star quantifier seem to grab
> all the data.  isn't findall() intended to return all matches?  i
> would expect either 'abc' or 'a', 'b', 'c' or at least just
> 'a' (because that would be the first match).  why does it give only
> one letter, and at that, the /last/ letter in the sequence??
> 
findall returns the matched groups. You get one group for each 
parenthesised sub-expression, and (the important bit) if a single 
parenthesised expression matches more than once the group only contains 
the last string which matched it.

Putting a star after a subexpression means that subexpression can match 
zero or more times, but each time it only matches a single character 
which is why your findall only returned the last character it matched.

You need to move the * inside the parentheses used to define the group, 
then the group will match only once but will include everything that it 
matched.

Consider:

>>> re.findall('(.)', 'abc')
['a', 'b', 'c']
>>> re.findall('(.)*', 'abc')
['c', '']
>>> re.findall('(.*)', 'abc')
['abc', '']

The first pattern finds a single character which findall manages to 
match 3 times.

The second pattern finds a group with a single character zero or more 
times in the pattern, so the first time it matches each of a,b,c in turn 
and returns the c, and then next time around we get an empty string when 
group matched zero times.

In the third pattern we are looking for a group with any number of 
characters in it. First time we get all of the string, then we get 
another empty match.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Michael Hoffman

proctor wrote:
> On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
>> On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote:

>>> rx_test = re.compile('/x([^x])*x/')
>>> s = '/xabcx/'
>>> if rx_test.findall(s):
>>> print rx_test.findall(s)
>>> 
>>> i expect the output to be ['abc'] however it gives me only the last
>>> single character in the group: ['c']
>
>> As Josiah already pointed out, the * needs to be inside the grouping
>> parens.

> so my question remains, why doesn't the star quantifier seem to grab
> all the data.

Because you didn't use it *inside* the group, as has been said twice. 
Let's take a simpler example:

 >>> import re
 >>> text = "xabc"
 >>> re_test1 = re.compile("x([^x])*")
 >>> re_test2 = re.compile("x([^x]*)")
 >>> re_test1.match(text).groups()
('c',)
 >>> re_test2.match(text).groups()
('abc',)

There are three places that match ([^x]) in text. But each time you find 
one you overwrite the previous example.

> isn't findall() intended to return all matches?

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used 
a grouping parenthesis in there, it only returns one group from each 
pattern.

Back to my example:

 >>> re_test1.findall("xabcxaaaxabc")
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple 
times as well. In your example there is only one match.

> i would expect either 'abc' or 'a', 'b', 'c' or at least just
> 'a' (because that would be the first match).

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or 
"a". You'd expect it to be "c".
-- 
Michael Hoffman
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread proctor

On Apr 27, 1:33 am, Paul McGuire <[EMAIL PROTECTED]> wrote:
> On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote:
>
>
>
> > hello,
>
> > i have a regex:  rx_test = re.compile('/x([^x])*x/')
>
> > which is part of this test program:
>
> > 
>
> > import re
>
> > rx_test = re.compile('/x([^x])*x/')
>
> > s = '/xabcx/'
>
> > if rx_test.findall(s):
> > print rx_test.findall(s)
>
> > 
>
> > i expect the output to be ['abc'] however it gives me only the last
> > single character in the group: ['c']
>
> > C:\test>python retest.py
> > ['c']
>
> > can anyone point out why this is occurring?  i can capture the entire
> > group by doing this:
>
> > rx_test = re.compile('/x([^x]+)*x/')
> > but why isn't the 'star' grabbing the whole group?  and why isn't each
> > letter 'a', 'b', and 'c' present, either individually, or as a group
> > (group is expected)?
>
> > any clarification is appreciated!
>
> > sincerely,
> > proctor
>
> As Josiah already pointed out, the * needs to be inside the grouping
> parens.
>
> Since re's do lookahead/backtracking, you can also write:
>
> rx_test = re.compile('/x(.*?)x/')
>
> The '?' is there to make sure the .* repetition stops at the first
> occurrence of x/.
>
> -- Paul

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl.  my post was a
snippet from a regex to match C comments.   every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275).  it looks like this:

===

/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then

(
either anything that is 'not x',

or,

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

followed finally by the closing delimiter.

===

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

so my question remains, why doesn't the star quantifier seem to grab
all the data.  isn't findall() intended to return all matches?  i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).  why does it give only
one letter, and at that, the /last/ letter in the sequence??

thanks again for replying!

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Paul McGuire

On Apr 27, 1:33 am, proctor <[EMAIL PROTECTED]> wrote:
> hello,
>
> i have a regex:  rx_test = re.compile('/x([^x])*x/')
>
> which is part of this test program:
>
> 
>
> import re
>
> rx_test = re.compile('/x([^x])*x/')
>
> s = '/xabcx/'
>
> if rx_test.findall(s):
> print rx_test.findall(s)
>
> 
>
> i expect the output to be ['abc'] however it gives me only the last
> single character in the group: ['c']
>
> C:\test>python retest.py
> ['c']
>
> can anyone point out why this is occurring?  i can capture the entire
> group by doing this:
>
> rx_test = re.compile('/x([^x]+)*x/')
> but why isn't the 'star' grabbing the whole group?  and why isn't each
> letter 'a', 'b', and 'c' present, either individually, or as a group
> (group is expected)?
>
> any clarification is appreciated!
>
> sincerely,
> proctor

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Since re's do lookahead/backtracking, you can also write:

rx_test = re.compile('/x(.*?)x/')

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-04-27 Thread Josiah Carlson

proctor wrote:
> i have a regex:  rx_test = re.compile('/x([^x])*x/')

You probably want...

rx_test = re.compile('/x([^x]*)x/')


  - Josiah
-- 
http://mail.python.org/mailman/listinfo/python-list

regex question

2007-04-26 Thread proctor

hello,

i have a regex:  rx_test = re.compile('/x([^x])*x/')

which is part of this test program:



import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)



i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\test>python retest.py
['c']

can anyone point out why this is occurring?  i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group?  and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-18 Thread Bill Mill

Gabriel Genellina wrote:
> At Tuesday 16/1/2007 16:36, Bill  Mill wrote:
>
> > > py> import re
> > > py> rgx = re.compile('1?')
> > > py> rgx.search('a1').groups()
> > > (None,)
> > > py> rgx = re.compile('(1)+')
> > > py> rgx.search('a1').groups()
> >
> >But shouldn't the ? be greedy, and thus prefer the one match to the
> >zero? This is my sticking point - I've seen that plus works, and this
> >just confuses me more.
>
> Perhaps you have misunderstood what search does.
> search( pattern, string[, flags])
>  Scan through string looking for a location where the regular
> expression pattern produces a match
>
> '1?' means 0 or 1 times '1', i.e., nothing or a single '1'.
> At the start of the target string, 'a1', we have nothing, so the re
> matches, and returns that occurrence. It doesnt matter that a few
> characters later there is *another* match, even if it is longer; once
> a match is found, the scan is done.
> If you want "the longest match of all possible matches along the
> string", you should use findall() instead of search().
>

That is exactly what I misunderstood. Thank you very much.

-Bill Mill
bill.mill at gmail.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-17 Thread Gabriel Genellina

At Tuesday 16/1/2007 16:36, Bill  Mill wrote:

> py> import re
> py> rgx = re.compile('1?')
> py> rgx.search('a1').groups()
> (None,)
> py> rgx = re.compile('(1)+')
> py> rgx.search('a1').groups()

But shouldn't the ? be greedy, and thus prefer the one match to the
zero? This is my sticking point - I've seen that plus works, and this
just confuses me more.

Perhaps you have misunderstood what search does.
search( pattern, string[, flags])
Scan through string looking for a location where the regular 
expression pattern produces a match

'1?' means 0 or 1 times '1', i.e., nothing or a single '1'.
At the start of the target string, 'a1', we have nothing, so the re 
matches, and returns that occurrence. It doesnt matter that a few 
characters later there is *another* match, even if it is longer; once 
a match is found, the scan is done.
If you want "the longest match of all possible matches along the 
string", you should use findall() instead of search().

--
Gabriel Genellina
Softlab SRL 

__ 
Preguntá. Respondé. Descubrí. 
Todo lo que querías saber, y lo que ni imaginabas, 
está en Yahoo! Respuestas (Beta). 
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-16 Thread Bill Mill

James Stroud wrote:
> Bill Mill wrote:
> > Hello all,
> >
> > I've got a test script:
> >
> >  start python code =
> >
> > tests2 = ["item1: alpha; item2: beta. item3 - gamma--",
> > "item1: alpha; item3 - gamma--"]
> >
> > def test_re(regex):
> >r = re.compile(regex, re.MULTILINE)
> >for test in tests2:
> >res = r.search(test)
> >if res:
> >print res.groups()
> >else:
> >print "Failed"
> >
> >  end python code 
> >
> > And a simple question:
> >
> > Why does the first regex that follows successfully grab "beta", while
> > the second one doesn't?
> >
> > In [131]: test_re(r"(?:item2: (.*?)\.)")
> > ('beta',)
> > Failed
> >
> > In [132]: test_re(r"(?:item2: (.*?)\.)?")
> > (None,)
> > (None,)
> >
> > Shouldn't the '?' greedily grab the group match?
> >
> > Thanks
> > Bill Mill
> > bill.mill at gmail.com
>
> The question-mark matches at zero or one. The first match will be a
> group with nothing in it, which satisfies the zero condition. Perhaps
> you mean "+"?
>
> e.g.
>
> py> import re
> py> rgx = re.compile('1?')
> py> rgx.search('a1').groups()
> (None,)
> py> rgx = re.compile('(1)+')
> py> rgx.search('a1').groups()

But shouldn't the ? be greedy, and thus prefer the one match to the
zero? This is my sticking point - I've seen that plus works, and this
just confuses me more.

-Bill Mill
bill.mill at gmail.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex Question

2007-01-12 Thread James Stroud

Bill Mill wrote:
> Hello all,
> 
> I've got a test script:
> 
>  start python code =
> 
> tests2 = ["item1: alpha; item2: beta. item3 - gamma--",
> "item1: alpha; item3 - gamma--"]
> 
> def test_re(regex):
>r = re.compile(regex, re.MULTILINE)
>for test in tests2:
>res = r.search(test)
>if res:
>print res.groups()
>else:
>print "Failed"
> 
>  end python code 
> 
> And a simple question:
> 
> Why does the first regex that follows successfully grab "beta", while
> the second one doesn't?
> 
> In [131]: test_re(r"(?:item2: (.*?)\.)")
> ('beta',)
> Failed
> 
> In [132]: test_re(r"(?:item2: (.*?)\.)?")
> (None,)
> (None,)
> 
> Shouldn't the '?' greedily grab the group match?
> 
> Thanks
> Bill Mill
> bill.mill at gmail.com

The question-mark matches at zero or one. The first match will be a 
group with nothing in it, which satisfies the zero condition. Perhaps 
you mean "+"?

e.g.

py> import re
py> rgx = re.compile('1?')
py> rgx.search('a1').groups()
(None,)
py> rgx = re.compile('(1)+')
py> rgx.search('a1').groups()

James
-- 
http://mail.python.org/mailman/listinfo/python-list

Regex Question

2007-01-10 Thread Bill Mill

Hello all,

I've got a test script:

 start python code =

tests2 = ["item1: alpha; item2: beta. item3 - gamma--",
"item1: alpha; item3 - gamma--"]

def test_re(regex):
r = re.compile(regex, re.MULTILINE)
for test in tests2:
res = r.search(test)
if res:
print res.groups()
else:
print "Failed"

 end python code 

And a simple question:

Why does the first regex that follows successfully grab "beta", while
the second one doesn't?

In [131]: test_re(r"(?:item2: (.*?)\.)")
('beta',)
Failed

In [132]: test_re(r"(?:item2: (.*?)\.)?")
(None,)
(None,)

Shouldn't the '?' greedily grab the group match?

Thanks
Bill Mill
bill.mill at gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-01-08 Thread Mark Peters

> yes, i suppose you are right.  i can't think of a reason i would NEED a
> raw string in this situation.
It looks from your code that you are trying to remove all occurances of
one string from the other.  a simple regex way would be to use re.sub()

>>> import re
>>> a = "abc"
>>> b = "debcabbde"
>>> re.sub("[" + a + "]","",b)
'dede'

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: regex question

2007-01-08 Thread proctor


Paul McGuire wrote:
> "proctor" <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
> >
> >
> > it does work now...however, one more question:  when i type:
> >
> > rx_a = re.compile(r'a|b|c')
> > it works correctly!
> >
>
> Do you see the difference between:
>
> rx_a = re.compile(r'a|b|c')
>
> and
>
> rx_a = re.compile("r'a|b|c'")
>
> There is no difference in the variable datatype between "string" and "raw
> string".  Raw strings are just a notational helper when creating string
> literals that have lots of backslashes in them (as happens a lot with
> regexps).
>
> r'a|b|c'  is the same as 'a|b|c'
> r'\d' is the same as '\\d'
>
> There is no reason to "add raw strings" to your makeRE method, since you
> don't have a single backslash anywhere.  And even if there were a backslash
> in the 'w' argument, it is just a string - no need to treat it differently.
> 
> -- Paul

thanks paul.  this helps.

proctor.

-- 
http://mail.python.org/mailman/listinfo/python-list

1 2 >

1 - 100 of 135 matches

Mail list logo