Re: substitution

Anthra Norell Mon, 18 Jan 2010 08:44:36 -0800

superpollo wrote:

hi.


what is the most pythonic way to substitute substrings?

eg: i want to apply:

foo --> bar
baz --> quux
quuux --> foo

so that:

fooxxxbazyyyquuux --> barxxxquuxyyyfoo

bye

Third attempt. Clearly something doesn't work right. My code getsclipped on the way up. I have to send it as an attachment. Here's againwhat it does:

>>> substitutions = (('foo', 'bar'), ('baz', 'quux'), ('quuux','foo')) # Sequence of doublets

>>> T = Translator (substitutions)   # Compile substitutions -> translator
>>> s = 'fooxxxbazyyyquuux'   # Your source string
>>> d = 'barxxxquuxyyyfoo'    # Your destination string
>>> print T (s)
barxxxquuxyyyfoo
>>> print T (s) == d
True


Code attached

Regards

Frederic

class Translator:                                    

        r"""
                Will translate any number of targets, handling them correctly 
if some overlap.

                Making Translator
                        T = Translator (definitions, [eat = 1])
                        'definitions' is a sequence of pairs: ((target, 
substitute),(t2, s2), ...)
                        'eat = True' will make an extraction filter that lets 
only the replaced targets pass.
                        Definitions example: 
(('a','A'),('b','B'),('ab','ab'),('abc','xyz'),
                           ('\x0c', 'page break'), ('\r\n','\n'), ('   ','\t')) 
  # ('ab','ab') see Tricks.
                        Order doesn't matter.          

                Running
                        translation = T (source)

                Tricks 
                        Deletion:  ('target', '')
                        Exception: (('\n',''), ('\n\n','\n\n'))     # Eat LF 
except paragraph breaks.
                        Exception: (('\n', '\r\n'), ('\r\n',\r\n')) # Unix to 
DOS, would leave DOS unchanged
                        Translation cascade: 
                                # Unwrap paragraphs, Unix or DOS, restoring 
inter-word space if missing,
                                Mark_LF = Translator 
((('\n','+LF+'),('\r\n','+LF+'),('\n\n','\n\n'),('\r\n\r\n','\r\n\r\n')))
                                # Pick any positively identifiable mark for end 
of lines in either Unix or MS-DOS.       
                                Single_Space_Mark = Translator (((' +LF+', ' 
'),('+LF+', ' '),('-+LF+', '')))
                                no_lf_text = Single_Space_Mark (Mark_LF (text))
                        Translation cascade: 
            # Nested calls
                                reptiles = T_latin_english (T_german_latin 
(reptilien))

                Limitations
                        1. The number of substitutions and the maximum size of 
input depends on the respective 
                                capabilities of the Python re module.
                        2. Regular expressions will not work as such.

                Author:
                        Frederic Rentsch (i...@anthra-norell.ch).
                         
        """

        def __init__ (self, definitions, eat = 0):

                '''
                        definitions: a sequence of pairs of strings. ((target, 
substitute), (t, s), ...)
                        eat: False (0) means translate: unaffected data passes 
unaltered.
                             True  (1) means extract:   unaffected data doesn't 
pass (gets eaten).
                             Extraction filters typically require substitutes 
to end with some separator, 
                             else they fuse together. (E.g. ' ', '\t' or '\n') 
                        'eat' is an attribute that can be switched anytime.

                '''                     
                self.eat = eat
                self.compile_sequence_of_pairs (definitions)
                
        
        def compile_sequence_of_pairs (self, definitions):

                '''
                        Argument 'definitions' is a sequence of pairs:
                        (('target 1', 'substitute 1'), ('t2', 's2'), ...)
                        Order doesn't matter.         

                '''
                                        
                import re
                self.definitions = definitions
                targets, substitutes = zip (*definitions)
                re_targets = [re.escape (item) for item in targets]
                re_targets.sort (reverse = True)
                self.targets_set = set (targets)                           
                self.table = dict (definitions)
                regex_string = '|'.join (re_targets)
                self.regex = re.compile (regex_string, re.DOTALL)
                        
        
        def __call__ (self, s):
                hits = self.regex.findall (s)
                nohits = self.regex.split (s)
                valid_hits = set (hits) & self.targets_set  # Ignore targets 
with illegal re modifiers.
                if valid_hits:
                        substitutes = [self.table [item] for item in hits if 
item in valid_hits] + []  # Make lengths equal for zip to work right
                        if self.eat:
                                return ''.join (substitutes)
                        else:            
                                zipped = zip (nohits, substitutes)
                                return ''.join (list (reduce (lambda a, b: a + 
b, [zipped][0]))) + nohits [-1]
                else:
                        if self.eat:
                                return ''
                        else:
                                return s

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: substitution

Reply via email to