gargonx wrote:
let's take the word "dogs"

   ext = dict("D":"V1",  "O":"M1", "G":"S1")
   std = dict("S":"H")

encode("DOGS") # proc()
we'll get: "V1M1S1H"

let's say i want to do just the opposite
word: "V1M1S1H"
decode("V1M1S1H")
    #how do i decode "V1" to "D", how do i keep the "V1" together?
and get: "DOGS"

If you can make some assumptions about the right-hand sides of your dicts, you can probably tokenize your string with a simple regular expression:


py> import re
py> charmatcher = re.compile(r'[A-Z][\d]?')
py>
py> ext = dict(D="V1", O="M1", G="S1")
py> std = dict(S="H")
py>
py> decode_replacements = {}
py> decode_replacements.update([(std[key], key) for key in std])
py> decode_replacements.update([(ext[key], key) for key in ext])
py>
py> def decode(text):
...     return ''.join([decode_replacements.get(c, c)
...                     for c in charmatcher.findall(text)])
...
py>
py> decode("V1M1S1H")
'DOGS'

So, instead of using
for c in text
I use
for c im charmatcher.findall(text)
That gives me the correct tokenization, and i can just use the inverted dicts to map it back. Note however that I've written the regular expression to depend on the fact that the values in std and ext are either single uppercase characters or single uppercase characters followed by a single digit.


Steve


-- http://mail.python.org/mailman/listinfo/python-list

Reply via email to