Re: Regexes: How to handle escaped characters
Hallöchen! John Machin writes: On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED] wrote: [...] Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while re.search(o$, string) will find the second o, guarded_search(o$, string, 0) Huh? Did you mean 4 instead of zero? No, the offset parameter is like the pos parameter in the search method of regular expression objects. It's like guarded_search(o$, string[offset:]) Actually, my real guarded_search even has an endpos parameter, too. [...] Quite apart from the confusing use of escape, your requirements are still as clear as mud. Try writing up docs for your guarded_search function. Note that I don't want to add functionality to the stdlib, I just want to solve my tiny annoying problem. Okay, here is a more complete story: I've specified a simple text document syntax, like reStructuredText, Wikimedia, LaTeX or whatever. I already have a preprocessor for it, now I try to implement the parser. A sectioning heading looks like this: Introduction Thus, my parser searches (among many other things) for r\n\s*={4,}\s*$. However, the author can escape any character with a backslash: Introduction or Introduction \===\=== This means the first (or fifth) equation sign is an equation sign as is and not part of a heading underlining. This must not be interpreted as a section begin. The preprocessor generates u=== with escaped_positions=[0]. (Or [4], in the righthand case.) This is why I cannot use normal search methods. [...] Whatever your exact requirement, it would seem unlikely to be so wildly popularly demanded as to warrant inclusion in the regexp machine. You would have to write your own wrapper, something like the following totally-untested example of one possible implementation of one possible guess at what you mean: import re def guarded_search(pattern, text, forbidden_offsets, overlap=False): regex = re.compile(pattern) pos = 0 while True: m = regex.search(text, pos) if not m: return start, end = m.span() for bad_pos in forbidden_offsets: if start = bad_pos end: break else: yield m if overlap: pos = start + 1 else: pos = end 8--- This is similar to my current approach, however, it also finds too many ^a patterns because it starts a fresh search at different positions. Tschö, Torsten. -- Torsten Bronger, aquisgrana, europa vetus Jabber ID: [EMAIL PROTECTED] (See http://ime.webhop.org for ICQ, MSN, etc.) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
Torsten Bronger wrote: Hallöchen! [...] Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while Tschö, Torsten. I'm still pretty much a beginner, and I am not sure of the exact requirements, but the following seems to work for at least simple cases when overlapping matches are not considered. def guarded_search( pattern, text, exclude ): return [ m for m in re.finditer(pattern,text) if not [ e for e in exclude if m.start() = e m.end() ] ] txt = axbycz exc = [ 3 ] # y pat = [xyz] mtch = guarded_search(pat,txt,exc) print Guarded search text='%s' excluding %s % ( txt,exc ) for m in mtch: print m.group(), 'at', m.start() txt = Hollo exc = [ 4 ] # Final o pat = o$ mtch = guarded_search(pat,txt,exc) print Guarded search text='%s' excluding %s %s matches % (txt,exc,len(mtch)) for m in mtch: print m.group(), 'at', m.start() Guarded search text='axbycz' excluding [3] 2 matches x at 1 z at 5 Guarded search text='Hollo' excluding [4] 0 matches Simply finds all the (non-overlapping) matches and rejects any that include one of the excluded columns (the y in the first case and the final o in the second). Charles -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
Hallöchen! Charles Sanders writes: Torsten Bronger wrote: [...] Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while I'm still pretty much a beginner, and I am not sure of the exact requirements, but the following seems to work for at least simple cases when overlapping matches are not considered. def guarded_search( pattern, text, exclude ): return [ m for m in re.finditer(pattern,text) if not [ e for e in exclude if m.start() = e m.end() ] ] Yes, this seems to do the trick, thank you! Tschö, Torsten. -- Torsten Bronger, aquisgrana, europa vetus Jabber ID: [EMAIL PROTECTED] (See http://ime.webhop.org for ICQ, MSN, etc.) -- http://mail.python.org/mailman/listinfo/python-list
Regexes: How to handle escaped characters
Hallöchen! I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. My current approach is to look for matches in substrings with the escaped characters as boundaries between the substrings. However, then ^ and $ in the patterns are treated wrongly. (Although I use startpos and endpos parameters for this and no slicing.) Another idea was to have a special unicode character that never takes part in a match. The docs are not very promising regarding such a thing, or did I miss something? Any other ideas? Tschö, Torsten. -- Torsten Bronger, aquisgrana, europa vetus Jabber ID: [EMAIL PROTECTED] (See http://ime.webhop.org for ICQ, MSN, etc.) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
Torsten Bronger wrote: Hallöchen! I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. My current approach is to look for matches in substrings with the escaped characters as boundaries between the substrings. However, then ^ and $ in the patterns are treated wrongly. (Although I use startpos and endpos parameters for this and no slicing.) Another idea was to have a special unicode character that never takes part in a match. The docs are not very promising regarding such a thing, or did I miss something? Any other ideas? Tschö, Torsten. You should probably provide examples of what you are trying to do or you will likely get a lot of irrelevant answers. James -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
Hallöchen! James Stroud writes: Torsten Bronger wrote: I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. [...] You should probably provide examples of what you are trying to do or you will likely get a lot of irrelevant answers. Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while re.search(o$, string) will find the second o, guarded_search(o$, string, 0) won't find anything. But how to program guarded_search? Actually, it is about changing the semantics of the regexp syntax: . doesn't mean anymore any character except newline but any character except newline and characters marked as escaped. And so on, for all syntax elements of regular expressions. Escaped characters must spoil any match, however, the regexp machine should continue to search for other matches. Tschö, Torsten. -- Torsten Bronger, aquisgrana, europa vetus Jabber ID: [EMAIL PROTECTED] (See http://ime.webhop.org for ICQ, MSN, etc.) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
Torsten Bronger wrote: Hallöchen! James Stroud writes: Torsten Bronger wrote: I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. [...] You should probably provide examples of what you are trying to do or you will likely get a lot of irrelevant answers. Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while re.search(o$, string) will find the second o, guarded_search(o$, string, 0) won't find anything. But how to program guarded_search? Actually, it is about changing the semantics of the regexp syntax: . doesn't mean anymore any character except newline but any character except newline and characters marked as escaped. And so on, for all syntax elements of regular expressions. Escaped characters must spoil any match, however, the regexp machine should continue to search for other matches. Tschö, Torsten. You will probably need to implement your own findall, etc., but this seems to do it for search: def guarded_search(rgx, astring, escaped): m = re.search(rgx, astring) if m: s = m.start() e = m.end() for i in escaped: if s = i = e: m = None break return m Here it is in use: py def guarded_search(rgx, astring, escaped): ... m = re.search(rgx, astring) ... if m: ... s = m.start() ... e = m.end() ... for i in escaped: ... if s = i = e: ... m = None ... break ... return m ... py import re py escaped = [1, 5, 15] py print guarded_search('abc', 'xyzabcxyz', escaped) None py print guarded_search('abc', 'xyzxyzabcxyz', escaped) _sre.SRE_Match object at 0x40379720 James -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED] wrote: Hallöchen! James Stroud writes: Torsten Bronger wrote: I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. [...] You should probably provide examples of what you are trying to do or you will likely get a lot of irrelevant answers. Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while re.search(o$, string) will find the second o, guarded_search(o$, string, 0) Huh? Did you mean 4 instead of zero? won't find anything. Quite apart from the confusing use of escape, your requirements are still as clear as mud. Try writing up docs for your guarded_search function. Supply test cases showing what you expect to match and what you don't expect to match. Is offset the offset in the text? If so, don't you really want a set of forbidden offsets, not just one? But how to program guarded_search? Actually, it is about changing the semantics of the regexp syntax: . doesn't mean anymore any character except newline but any character except newline and characters marked as escaped. Make up your mind whether you are escaping characters [likely to be interpreted by many people as position-independent] or escaping positions within the text. And so on, for all syntax elements of regular expressions. Escaped characters must spoil any match, however, the regexp machine should continue to search for other matches. Whatever your exact requirement, it would seem unlikely to be so wildly popularly demanded as to warrant inclusion in the regexp machine. You would have to write your own wrapper, something like the following totally-untested example of one possible implementation of one possible guess at what you mean: import re def guarded_search(pattern, text, forbidden_offsets, overlap=False): regex = re.compile(pattern) pos = 0 while True: m = regex.search(text, pos) if not m: return start, end = m.span() for bad_pos in forbidden_offsets: if start = bad_pos end: break else: yield m if overlap: pos = start + 1 else: pos = end 8--- HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
On May 18, 6:50 am, James Stroud [EMAIL PROTECTED] wrote: def guarded_search(rgx, astring, escaped): m = re.search(rgx, astring) if m: s = m.start() e = m.end() for i in escaped: if s = i = e: Did you mean to write if s = i e: ? m = None break return m Your guarded search fails if there is a match after the rightmost bad position i.e. it gives up at the first bad position. My guarded_search (see separated post) needs the following done to it: 1. make a copy 2. change name of copy to guarded_searchall or something similar 3. change yield to return in the original Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
On May 17, 4:06 pm, John Machin [EMAIL PROTECTED] wrote: On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED] wrote: Hallöchen! James Stroud writes: Torsten Bronger wrote: I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. [...] You should probably provide examples of what you are trying to do or you will likely get a lot of irrelevant answers. Example string: uHollo, escaped positions: [4]. Thus, the second o is escaped and must not be found be the regexp searches. Instead of re.search, I call the function guarded_search(pattern, text, offset) which takes care of escaped caracters. Thus, while re.search(o$, string) will find the second o, guarded_search(o$, string, 0) Huh? Did you mean 4 instead of zero? won't find anything. Quite apart from the confusing use of escape, your requirements are still as clear as mud. Try writing up docs for your guarded_search function. Supply test cases showing what you expect to match and what you don't expect to match. Is offset the offset in the text? If so, don't you really want a set of forbidden offsets, not just one? But how to program guarded_search? Actually, it is about changing the semantics of the regexp syntax: . doesn't mean anymore any character except newline but any character except newline and characters marked as escaped. Make up your mind whether you are escaping characters [likely to be interpreted by many people as position-independent] or escaping positions within the text. And so on, for all syntax elements of regular expressions. Escaped characters must spoil any match, however, the regexp machine should continue to search for other matches. Whatever your exact requirement, it would seem unlikely to be so wildly popularly demanded as to warrant inclusion in the regexp machine. You would have to write your own wrapper, something like the following totally-untested example of one possible implementation of one possible guess at what you mean: import re def guarded_search(pattern, text, forbidden_offsets, overlap=False): regex = re.compile(pattern) pos = 0 while True: m = regex.search(text, pos) if not m: return start, end = m.span() for bad_pos in forbidden_offsets: if start = bad_pos end: break else: yield m if overlap: pos = start + 1 else: pos = end 8--- HTH, John- Hide quoted text - - Show quoted text - Here are two pyparsing-based routines, guardedSearch and guardedSearchByColumn. The first uses a pyparsing parse action to reject matches at a given string location, and returns a list of tuples containing the string location and matched text. The second uses an enhanced version of guardedSearch that uses the pyparsing built-ins col and lineno to filter matches by column instead of by raw string location, and returns a list of tuples of line and column of the match location, and the matching text. (Note that string locations are zero-based, while line and column numbers are 1-based.) -- Paul from pyparsing import Regex,ParseException,col,lineno def guardedSearch(pattern, text, forbidden_offsets): def offsetValidator(strng,locn,tokens): if locn in forbidden_offsets: raise ParseException, can't match at offset %d % locn regex = Regex(pattern).setParseAction(offsetValidator) return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in regex.scanString(text) ] print guardedSearch(uo, uHollo how are you, [4,]) def guardedSearchByColumn(pattern, text, forbidden_columns): def offsetValidator(strng,locn,tokens): if col(locn,strng) in forbidden_columns: raise ParseException, can't match at offset %d % locn regex = Regex(pattern).setParseAction(offsetValidator) return [ (lineno(tokStart,text),col(tokStart,text),toks[0]) for toks,tokStart,tokEnd in regex.scanString(text) ] text = \ alksjdflasjf;sa a;sljflsjlaj ;asjflasfja;sf aslfj;asfj;dsf aslf;lajdf;ajsf aslfj;afsj;sd print guardedSearchByColumn(;, text, [1,6,11,]) Prints: [(1, 'o'), (7, 'o'), (15, 'o')] [(1, 13, ';'), (2, 2, ';'), (3, 12, ';'), (5, 5, ';')] -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
On May 18, 8:16 am, Paul McGuire [EMAIL PROTECTED] wrote: On May 17, 4:06 pm, John Machin [EMAIL PROTECTED] wrote: On May 18, 6:00 am, Torsten Bronger [EMAIL PROTECTED] wrote: Hallöchen! James Stroud writes: Torsten Bronger wrote: I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of any match. Note: must not be *part of* any match [my emphasis] [big snip] Here are two pyparsing-based routines, guardedSearch and guardedSearchByColumn. The first uses a pyparsing parse action to reject matches at a given string location Seems to be somewhat less like what the OP might have in mind ... While we're waiting for clarification from the OP, there's a chicken- and-egg thought that's been nagging me: if the OP knows so much about the searched string that he can specify offsets which search patterns should not span, why does he still need to search it? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
On May 17, 6:12 pm, John Machin [EMAIL PROTECTED] wrote: Note: must not be *part of* any match [my emphasis] Ooops, my bad. See this version: from pyparsing import Regex,ParseException,col,lineno,getTokensEndLoc # fake (and inefficient) version of any if not yet upgraded to Py2.5 any = lambda lst : sum(list(lst)) 0 def guardedSearch(pattern, text, forbidden_offsets): def offsetValidator(strng,locn,tokens): start,end = locn,getTokensEndLoc()-1 if any( start = i = end for i in forbidden_offsets ): raise ParseException, can't match at offset %d % locn regex = Regex(pattern).setParseAction(offsetValidator) return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in regex.scanString(text) ] print guardedSearch(uro\S, uHollo how are you, [8,]) def guardedSearchByColumn(pattern, text, forbidden_columns): def offsetValidator(strng,locn,tokens): start,end = col(locn,strng), col(getTokensEndLoc(),strng)-1 if any( start = i = end for i in forbidden_columns ): raise ParseException, can't match at col %d % start regex = Regex(pattern).setParseAction(offsetValidator) return [ (lineno(tokStart,text),col(tokStart,text),toks[0]) for toks,tokStart,tokEnd in regex.scanString(text) ] text = \ alksjdflasjf;sa a;sljflsjlaj ;asjflasfja;sf aslfj;asfj;dsf aslf;lajdf;ajsf aslfj;afsj;sd print guardedSearchByColumn([fa];, text, [4,12,13,]) Prints: [(1, 'ol'), (15, 'ou')] [(2, 1, 'a;'), (5, 10, 'f;')] While we're waiting for clarification from the OP, there's a chicken- and-egg thought that's been nagging me: if the OP knows so much about the searched string that he can specify offsets which search patterns should not span, why does he still need to search it? I suspect that this is column/tabular data (a log file perhaps?), and some columns are not interesting, but produce many false hits for the search pattern. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regexes: How to handle escaped characters
On May 18, 9:46 am, Paul McGuire [EMAIL PROTECTED] wrote: On May 17, 6:12 pm, John Machin [EMAIL PROTECTED] wrote: Note: must not be *part of* any match [my emphasis] While we're waiting for clarification from the OP, there's a chicken- and-egg thought that's been nagging me: if the OP knows so much about the searched string that he can specify offsets which search patterns should not span, why does he still need to search it? I suspect that this is column/tabular data (a log file perhaps?), and some columns are not interesting, but produce many false hits for the search pattern. If so, why not split the record into fields and look only at the interesting fields? Smells to me of yet another case of re abuse/ misuse ... -- http://mail.python.org/mailman/listinfo/python-list