superpollo wrote:
hi.
what is the most pythonic way to substitute substrings?
eg: i want to apply:
foo --> bar
baz --> quux
quuux --> foo
so that:
fooxxxbazyyyquuux --> barxxxquuxyyyfoo
bye
Try the code below the dotted line. It does any number of substitutions
and handles overlaps correctly (long over short)
Your case:
>>> substitutions = (('foo', 'bar'), ('baz', 'quux'), ('quuux',
'foo')) # Sequence of doublets
>>> T = Translator (substitutions) # Compile substitutions -> translator
>>> s = 'fooxxxbazyyyquuux' # Your source string
>>> d = 'barxxxquuxyyyfoo' # Your destination string
>>> print T (s)
barxxxquuxyyyfoo
>>> print T (s) == d
True
Frederic
-------------------------------------------------------------
class Translator:
r"""
Will translate any number of targets, handling them correctly if
some overlap.
Making Translator
T = Translator (definitions, [eat = 1])
'definitions' is a sequence of pairs: ((target,
substitute),(t2, s2), ...)
'eat' says whether untargeted sections pass (translator) or
are skipped (extractor).
Makes a translator by default (eat = False)
T.eat is an instance attribute that can be changed at
any time.
Definitions example:
(('a','A'),('b','B'),('ab','ab'),('abc','xyz') # ('ab','ab') see Tricks.
('\x0c', 'page break'), ('\r\n','\n'), (' ','\t'))
Order doesn't matter.
Running
translation = T (source)
Tricks
Deletion: ('target', '')
Exception: (('\n',''), ('\n\n','\n\n')) # Eat LF except
paragraph breaks.
Exception: (('\n', '\r\n'), ('\r\n',\r\n')) # Unix to DOS,
would leave DOS unchanged
Translation cascade:
# Rejoin text lines per paragraph Unix or DOS, inserting
inter-word space if missing
Mark_LF = Translator
((('\n','+LF+'),('\r\n','+LF+'),('\r\n\r\n','\r\n\r\n'),('\n\n','\n\n')))
# Pick positively identifiable mark for Unix and DOS end
of lines
Single_Space_Mark = Translator (((' +LF+', ' '),('+LF+',
' '),('-+LF+', '')))
no_lf_text = Single_Space_Mark (Mark_LF (text))
Translation cascade:
# Nesting calls
reptiles = T_latin_english (T_german_latin (reptilien))
Limitations
1. The number of substitutions and the maximum size of input
depends on the respective
capabilities of the Python re module.
2. Regular expressions will not work as such.
Author:
Frederic Rentsch (anthra.nor...@bluewin.ch).
"""
def __init__ (self, definitions, eat = 0):
'''
definitions: a sequence of pairs of strings. ((target,
substitute), (t, s), ...)
eat: False (0) means translate: unaffected data passes
unaltered.
True (1) means extract: unaffected data doesn't pass
(gets eaten).
Extraction filters typically require substitutes to end
with some separator,
else they fuse together. (E.g. ' ', '\t' or '\n')
'eat' is an attribute that can be switched anytime.
'''
self.eat = eat
self.compile_sequence_of_pairs (definitions)
def compile_sequence_of_pairs (self, definitions):
'''
Argument 'definitions' is a sequence of pairs:
(('target 1', 'substitute 1'), ('t2', 's2'), ...)
Order doesn't matter.
'''
import re
self.definitions = definitions
targets, substitutes = zip (*definitions)
re_targets = [re.escape (item) for item in targets]
re_targets.sort (reverse = True)
self.targets_set = set (targets)
self.table = dict (definitions)
regex_string = '|'.join (re_targets)
self.regex = re.compile (regex_string, re.DOTALL)
def __call__ (self, s):
hits = self.regex.findall (s)
nohits = self.regex.split (s)
valid_hits = set (hits) & self.targets_set # Ignore targets
with illegal re modifiers.
if valid_hits:
substitutes = [self.table [item] for item in hits if item in
valid_hits] + [] # Make lengths equal for zip to work right
if self.eat:
return ''.join (substitutes)
else:
zipped = zip (nohits, substitutes)
return ''.join (list (reduce (lambda a, b: a + b,
[zipped][0]))) + nohits [-1]
else:
if self.eat:
return ''
else:
return s
--
http://mail.python.org/mailman/listinfo/python-list