Re: Regexes: How to handle escaped characters

2007-05-18 Thread Torsten Bronger
Hallöchen!

John Machin writes:

 On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED]
 wrote:

 [...]

 Example string: uHollo, escaped positions: [4].  Thus, the
 second o is escaped and must not be found be the regexp
 searches.

 Instead of re.search, I call the function guarded_search(pattern,
 text, offset) which takes care of escaped caracters.  Thus, while

 re.search(o$, string)

 will find the second o,

 guarded_search(o$, string, 0)

 Huh? Did you mean 4 instead of zero?

No, the offset parameter is like the pos parameter in the search
method of regular expression objects.  It's like

guarded_search(o$, string[offset:])

Actually, my real guarded_search even has an endpos parameter,
too.

 [...]

 Quite apart from the confusing use of escape, your requirements are
 still as clear as mud. Try writing up docs for your guarded_search
 function.

Note that I don't want to add functionality to the stdlib, I just
want to solve my tiny annoying problem.  Okay, here is a more
complete story:

I've specified a simple text document syntax, like reStructuredText,
Wikimedia, LaTeX or whatever.  I already have a preprocessor for it,
now I try to implement the parser.  A sectioning heading looks like
this:

Introduction


Thus, my parser searches (among many other things) for
r\n\s*={4,}\s*$.  However, the author can escape any character
with a backslash:

Introduction or Introduction
\===\===

This means the first (or fifth) equation sign is an equation sign as
is and not part of a heading underlining.  This must not be
interpreted as a section begin.  The preprocessor generates
u=== with escaped_positions=[0].  (Or [4], in the
righthand case.)

This is why I cannot use normal search methods.

 [...]

 Whatever your exact requirement, it would seem unlikely to be so
 wildly popularly demanded as to warrant inclusion in the regexp
 machine. You would have to write your own wrapper, something like
 the following totally-untested example of one possible
 implementation of one possible guess at what you mean:

 import re
 def guarded_search(pattern, text, forbidden_offsets, overlap=False):
 regex = re.compile(pattern)
 pos = 0
 while True:
 m = regex.search(text, pos)
 if not m:
 return
 start, end = m.span()
 for bad_pos in forbidden_offsets:
 if start = bad_pos  end:
 break
 else:
 yield m
 if overlap:
 pos = start + 1
 else:
 pos = end
 8---

This is similar to my current approach, however, it also finds too
many ^a patterns because it starts a fresh search at different
positions.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
  Jabber ID: [EMAIL PROTECTED]
  (See http://ime.webhop.org for ICQ, MSN, etc.)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-18 Thread Charles Sanders
Torsten Bronger wrote:
 Hallöchen!
[...]

 Example string: uHollo, escaped positions: [4].  Thus, the
 second o is escaped and must not be found be the regexp
 searches.

 Instead of re.search, I call the function guarded_search(pattern,
 text, offset) which takes care of escaped caracters.  Thus, while

 
 Tschö,
 Torsten.

I'm still pretty much a beginner, and I am not sure
of the exact requirements, but the following seems to work
for at least simple cases when overlapping matches are not
considered.

def guarded_search( pattern, text, exclude ):
   return [ m for m in re.finditer(pattern,text)
 if not [ e for e in exclude if m.start() = e  m.end() ] ]

txt = axbycz
exc = [ 3 ]  # y
pat = [xyz]
mtch = guarded_search(pat,txt,exc)
print Guarded search text='%s' excluding %s % ( txt,exc )
for m in mtch:
   print m.group(), 'at', m.start()

txt = Hollo
exc = [ 4 ]  # Final o
pat = o$
mtch = guarded_search(pat,txt,exc)
print Guarded search text='%s' excluding %s %s matches % 
(txt,exc,len(mtch))
for m in mtch:
   print m.group(), 'at', m.start()

Guarded search text='axbycz' excluding [3] 2 matches
x at 1
z at 5
Guarded search text='Hollo' excluding [4] 0 matches


Simply finds all the (non-overlapping) matches and rejects any
that include one of the excluded columns (the y in the first
case and the final o in the second).

Charles
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-18 Thread Torsten Bronger
Hallöchen!

Charles Sanders writes:

 Torsten Bronger wrote:

 [...]

 Example string: uHollo, escaped positions: [4].  Thus, the
 second o is escaped and must not be found be the regexp
 searches.

 Instead of re.search, I call the function guarded_search(pattern,
 text, offset) which takes care of escaped caracters.  Thus, while

   I'm still pretty much a beginner, and I am not sure
 of the exact requirements, but the following seems to work
 for at least simple cases when overlapping matches are not
 considered.

 def guarded_search( pattern, text, exclude ):
   return [ m for m in re.finditer(pattern,text)
 if not [ e for e in exclude if m.start() = e  m.end() ] ]

Yes, this seems to do the trick, thank you!

Tschö,
Torsten.

-- 
Torsten Bronger, aquisgrana, europa vetus
  Jabber ID: [EMAIL PROTECTED]
  (See http://ime.webhop.org for ICQ, MSN, etc.)
-- 
http://mail.python.org/mailman/listinfo/python-list


Regexes: How to handle escaped characters

2007-05-17 Thread Torsten Bronger
Hallöchen!

I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices).  Escaped means that they must not be part of any match.

My current approach is to look for matches in substrings with the
escaped characters as boundaries between the substrings.  However,
then ^ and $ in the patterns are treated wrongly.  (Although I use
startpos and endpos parameters for this and no slicing.)

Another idea was to have a special unicode character that never
takes part in a match.  The docs are not very promising regarding
such a thing, or did I miss something?

Any other ideas?

Tschö,
Torsten.

-- 
Torsten Bronger, aquisgrana, europa vetus
  Jabber ID: [EMAIL PROTECTED]
  (See http://ime.webhop.org for ICQ, MSN, etc.)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread James Stroud
Torsten Bronger wrote:
 Hallöchen!
 
 I need some help with finding matches in a string that has some
 characters which are marked as escaped (in a separate list of
 indices).  Escaped means that they must not be part of any match.
 
 My current approach is to look for matches in substrings with the
 escaped characters as boundaries between the substrings.  However,
 then ^ and $ in the patterns are treated wrongly.  (Although I use
 startpos and endpos parameters for this and no slicing.)
 
 Another idea was to have a special unicode character that never
 takes part in a match.  The docs are not very promising regarding
 such a thing, or did I miss something?
 
 Any other ideas?
 
 Tschö,
 Torsten.
 

You should probably provide examples of what you are trying to do or you 
will likely get a lot of irrelevant answers.

James
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread Torsten Bronger
Hallöchen!

James Stroud writes:

 Torsten Bronger wrote:

 I need some help with finding matches in a string that has some
 characters which are marked as escaped (in a separate list of
 indices).  Escaped means that they must not be part of any match.

 [...]

 You should probably provide examples of what you are trying to do
 or you will likely get a lot of irrelevant answers.

Example string: uHollo, escaped positions: [4].  Thus, the second
o is escaped and must not be found be the regexp searches.

Instead of re.search, I call the function guarded_search(pattern,
text, offset) which takes care of escaped caracters.  Thus, while

re.search(o$, string)

will find the second o,

guarded_search(o$, string, 0)

won't find anything.  But how to program guarded_search?
Actually, it is about changing the semantics of the regexp syntax:
. doesn't mean anymore any character except newline but any
character except newline and characters marked as escaped.  And so
on, for all syntax elements of regular expressions.  Escaped
characters must spoil any match, however, the regexp machine should
continue to search for other matches.

Tschö,
Torsten.

-- 
Torsten Bronger, aquisgrana, europa vetus
  Jabber ID: [EMAIL PROTECTED]
  (See http://ime.webhop.org for ICQ, MSN, etc.)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread James Stroud
Torsten Bronger wrote:
 Hallöchen!
 
 James Stroud writes:
 
 
Torsten Bronger wrote:


I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices).  Escaped means that they must not be part of any match.

[...]

You should probably provide examples of what you are trying to do
or you will likely get a lot of irrelevant answers.
 
 
 Example string: uHollo, escaped positions: [4].  Thus, the second
 o is escaped and must not be found be the regexp searches.
 
 Instead of re.search, I call the function guarded_search(pattern,
 text, offset) which takes care of escaped caracters.  Thus, while
 
 re.search(o$, string)
 
 will find the second o,
 
 guarded_search(o$, string, 0)
 
 won't find anything.  But how to program guarded_search?
 Actually, it is about changing the semantics of the regexp syntax:
 . doesn't mean anymore any character except newline but any
 character except newline and characters marked as escaped.  And so
 on, for all syntax elements of regular expressions.  Escaped
 characters must spoil any match, however, the regexp machine should
 continue to search for other matches.
 
 Tschö,
 Torsten.
 

You will probably need to implement your own findall, etc., but this 
seems to do it for search:

def guarded_search(rgx, astring, escaped):
   m = re.search(rgx, astring)
   if m:
 s = m.start()
 e = m.end()
 for i in escaped:
   if s = i = e:
 m = None
 break
   return m


Here it is in use:

py def guarded_search(rgx, astring, escaped):
...   m = re.search(rgx, astring)
...   if m:
... s = m.start()
... e = m.end()
... for i in escaped:
...   if s = i = e:
... m = None
... break
...   return m
...
py import re
py escaped = [1, 5, 15]
py print guarded_search('abc', 'xyzabcxyz', escaped)
None
py print guarded_search('abc', 'xyzxyzabcxyz', escaped)
_sre.SRE_Match object at 0x40379720

James
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread John Machin
On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED]
wrote:
 Hallöchen!

 James Stroud writes:
  Torsten Bronger wrote:

  I need some help with finding matches in a string that has some
  characters which are marked as escaped (in a separate list of
  indices).  Escaped means that they must not be part of any match.

  [...]

  You should probably provide examples of what you are trying to do
  or you will likely get a lot of irrelevant answers.

 Example string: uHollo, escaped positions: [4].  Thus, the second
 o is escaped and must not be found be the regexp searches.

 Instead of re.search, I call the function guarded_search(pattern,
 text, offset) which takes care of escaped caracters.  Thus, while

 re.search(o$, string)

 will find the second o,

 guarded_search(o$, string, 0)

Huh? Did you mean 4 instead of zero?


 won't find anything.

Quite apart from the confusing use of escape, your requirements are
still as clear as mud. Try writing up docs for your guarded_search
function. Supply test cases showing what you expect to match and what
you don't expect to match. Is offset the offset in the text? If so,
don't you really want a set of forbidden offsets, not just one?

  But how to program guarded_search?
 Actually, it is about changing the semantics of the regexp syntax:
 . doesn't mean anymore any character except newline but any
 character except newline and characters marked as escaped.

Make up your mind whether you are escaping characters [likely to be
interpreted by many people as position-independent] or escaping
positions within the text.

  And so
 on, for all syntax elements of regular expressions.  Escaped
 characters must spoil any match, however, the regexp machine should
 continue to search for other matches.


Whatever your exact requirement, it would seem unlikely to be so
wildly popularly demanded as to warrant inclusion in the regexp
machine. You would have to write your own wrapper, something like the
following totally-untested example of one possible implementation of
one possible guess at what you mean:

import re
def guarded_search(pattern, text, forbidden_offsets, overlap=False):
regex = re.compile(pattern)
pos = 0
while True:
m = regex.search(text, pos)
if not m:
return
start, end = m.span()
for bad_pos in forbidden_offsets:
if start = bad_pos  end:
break
else:
yield m
if overlap:
pos = start + 1
else:
pos = end
8---

HTH,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread John Machin
On May 18, 6:50 am, James Stroud [EMAIL PROTECTED] wrote:

 def guarded_search(rgx, astring, escaped):
m = re.search(rgx, astring)
if m:
  s = m.start()
  e = m.end()
  for i in escaped:
if s = i = e:

Did you mean to write

if s = i  e:

?


  m = None
  break
return m


Your guarded search fails if there is a match after the rightmost bad
position i.e. it gives up at the first bad position.

My guarded_search (see separated post) needs the following done to
it:
1. make a copy
2. change name of copy to guarded_searchall or something similar
3. change yield to return in the original

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread Paul McGuire
On May 17, 4:06 pm, John Machin [EMAIL PROTECTED] wrote:
 On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED]
 wrote:





  Hallöchen!

  James Stroud writes:
   Torsten Bronger wrote:

   I need some help with finding matches in a string that has some
   characters which are marked as escaped (in a separate list of
   indices).  Escaped means that they must not be part of any match.

   [...]

   You should probably provide examples of what you are trying to do
   or you will likely get a lot of irrelevant answers.

  Example string: uHollo, escaped positions: [4].  Thus, the second
  o is escaped and must not be found be the regexp searches.

  Instead of re.search, I call the function guarded_search(pattern,
  text, offset) which takes care of escaped caracters.  Thus, while

  re.search(o$, string)

  will find the second o,

  guarded_search(o$, string, 0)

 Huh? Did you mean 4 instead of zero?



  won't find anything.

 Quite apart from the confusing use of escape, your requirements are
 still as clear as mud. Try writing up docs for your guarded_search
 function. Supply test cases showing what you expect to match and what
 you don't expect to match. Is offset the offset in the text? If so,
 don't you really want a set of forbidden offsets, not just one?

   But how to program guarded_search?
  Actually, it is about changing the semantics of the regexp syntax:
  . doesn't mean anymore any character except newline but any
  character except newline and characters marked as escaped.

 Make up your mind whether you are escaping characters [likely to be
 interpreted by many people as position-independent] or escaping
 positions within the text.

   And so
  on, for all syntax elements of regular expressions.  Escaped
  characters must spoil any match, however, the regexp machine should
  continue to search for other matches.

 Whatever your exact requirement, it would seem unlikely to be so
 wildly popularly demanded as to warrant inclusion in the regexp
 machine. You would have to write your own wrapper, something like the
 following totally-untested example of one possible implementation of
 one possible guess at what you mean:

 import re
 def guarded_search(pattern, text, forbidden_offsets, overlap=False):
 regex = re.compile(pattern)
 pos = 0
 while True:
 m = regex.search(text, pos)
 if not m:
 return
 start, end = m.span()
 for bad_pos in forbidden_offsets:
 if start = bad_pos  end:
 break
 else:
 yield m
 if overlap:
 pos = start + 1
 else:
 pos = end
 8---

 HTH,
 John- Hide quoted text -

 - Show quoted text -

Here are two pyparsing-based routines, guardedSearch and
guardedSearchByColumn.  The first uses a pyparsing parse action to
reject matches at a given string location, and returns a list of
tuples containing the string location and matched text.  The second
uses an enhanced version of guardedSearch that uses the pyparsing
built-ins col and lineno to filter matches by column instead of by raw
string location, and returns a list of tuples of line and column of
the match location, and the matching text.  (Note that string
locations are zero-based, while line and column numbers are 1-based.)

-- Paul


from pyparsing import Regex,ParseException,col,lineno

def guardedSearch(pattern, text, forbidden_offsets):

def offsetValidator(strng,locn,tokens):
if locn in forbidden_offsets:
raise ParseException, can't match at offset %d % locn

regex = Regex(pattern).setParseAction(offsetValidator)
return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in
regex.scanString(text) ]

print guardedSearch(uo, uHollo how are you, [4,])


def guardedSearchByColumn(pattern, text, forbidden_columns):

def offsetValidator(strng,locn,tokens):
if col(locn,strng) in forbidden_columns:
raise ParseException, can't match at offset %d % locn

regex = Regex(pattern).setParseAction(offsetValidator)
return [ (lineno(tokStart,text),col(tokStart,text),toks[0])
for toks,tokStart,tokEnd in regex.scanString(text) ]

text = \
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd

print guardedSearchByColumn(;, text, [1,6,11,])

Prints:
[(1, 'o'), (7, 'o'), (15, 'o')]
[(1, 13, ';'), (2, 2, ';'), (3, 12, ';'), (5, 5, ';')]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread John Machin
On May 18, 8:16 am, Paul McGuire [EMAIL PROTECTED] wrote:
 On May 17, 4:06 pm, John Machin [EMAIL PROTECTED] wrote:



  On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED]
  wrote:

   Hallöchen!

   James Stroud writes:
Torsten Bronger wrote:

I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices).  Escaped means that they must not be part of any match.

Note: must not be *part of* any match [my emphasis]

[big snip]

 Here are two pyparsing-based routines, guardedSearch and
 guardedSearchByColumn.  The first uses a pyparsing parse action to
 reject matches at a given string location

Seems to be somewhat less like what the OP might have in mind ...

While we're waiting for clarification from the OP, there's a chicken-
and-egg thought that's been nagging me: if the OP knows so much about
the searched string that he can specify offsets which search patterns
should not span, why does he still need to search it?

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread Paul McGuire
On May 17, 6:12 pm, John Machin [EMAIL PROTECTED] wrote:

 Note: must not be *part of* any match [my emphasis]

Ooops, my bad.  See this version:

from pyparsing import Regex,ParseException,col,lineno,getTokensEndLoc

# fake (and inefficient) version of any if not yet upgraded to Py2.5
any = lambda lst : sum(list(lst))  0

def guardedSearch(pattern, text, forbidden_offsets):

def offsetValidator(strng,locn,tokens):
start,end = locn,getTokensEndLoc()-1
if any( start = i = end for i in forbidden_offsets ):
raise ParseException, can't match at offset %d % locn

regex = Regex(pattern).setParseAction(offsetValidator)
return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in
regex.scanString(text) ]

print guardedSearch(uro\S, uHollo how are you, [8,])


def guardedSearchByColumn(pattern, text, forbidden_columns):

def offsetValidator(strng,locn,tokens):
start,end = col(locn,strng), col(getTokensEndLoc(),strng)-1
if any( start = i = end for i in forbidden_columns ):
raise ParseException, can't match at col %d % start

regex = Regex(pattern).setParseAction(offsetValidator)
return [ (lineno(tokStart,text),col(tokStart,text),toks[0])
for toks,tokStart,tokEnd in regex.scanString(text) ]

text = \
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd

print guardedSearchByColumn([fa];, text, [4,12,13,])

Prints:
[(1, 'ol'), (15, 'ou')]
[(2, 1, 'a;'), (5, 10, 'f;')]


 While we're waiting for clarification from the OP, there's a chicken-
 and-egg thought that's been nagging me: if the OP knows so much about
 the searched string that he can specify offsets which search patterns
 should not span, why does he still need to search it?

I suspect that this is column/tabular data (a log file perhaps?), and
some columns are not interesting, but produce many false hits for the
search pattern.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regexes: How to handle escaped characters

2007-05-17 Thread John Machin
On May 18, 9:46 am, Paul McGuire [EMAIL PROTECTED] wrote:
 On May 17, 6:12 pm, John Machin [EMAIL PROTECTED] wrote:

  Note: must not be *part of* any match [my emphasis]


  While we're waiting for clarification from the OP, there's a chicken-
  and-egg thought that's been nagging me: if the OP knows so much about
  the searched string that he can specify offsets which search patterns
  should not span, why does he still need to search it?

 I suspect that this is column/tabular data (a log file perhaps?), and
 some columns are not interesting, but produce many false hits for the
 search pattern.


If so, why not split the record into fields and look only at the
interesting fields? Smells to me of yet another case of re abuse/
misuse ...


-- 
http://mail.python.org/mailman/listinfo/python-list