2007/11/28, Victor Subervi <[EMAIL PROTECTED]>: > > Hi; > I am trying to find words in a document that are identical to any word in > a vocabulary list, to replace that word with special markup. Let's say the > word is "dharma". I don't want to replace the first few letters of, say > "dharmawuhirfuhi". Also, to make matters more difficult, if the word > "adharma" is found in the document, I need to replace that with special > markup, too. (In Sanskrit, "a" preceding a word negates the word.) But I > don't want to replace "adharma" and then go off and replace the "dharma" in > "adharma", thus having nested markup. Now, I tried separating out all the > words in the line (I go through the doc line by line), but then, of course, > I lost all the punctuation! So now I have this code: > > for word in vocab: > aword = "a" + word > try: > line = re.sub(aword, pu_four + aword + > pu_five + aword + pu_six, line) > except: > pass > try: > line = re.sub(word, pu_one + word + pu_two > + word + pu_three, line) > except: > pass > > which, of course, ends up breaking all the above! Can someone send me a > shovel to dig my way out of this mess? > TIA, > Victor > > -- > http://mail.python.org/mailman/listinfo/python-list
Hi, I'm not quite sure, what the try - else clauses are expected to catch (if there is no match, the replace function simply leaves the original string; there would be at least exceptions with invalid patterns), but if I understand the problem correctly, I would use the usual regexp means to match whole words only. It would be the \b metacharacter indicating word boundary; e.g.the patern \bdharma\b should only match a complete word "dharma" but not "adharma " or "dharmawuhirfuhi" see http://docs.python.org/lib/re-syntax.html for details about the re patterns (especially \b can be unicode or locale dependent, but it seems, that you are using a basic latin transcription, so it wouldn't matter). However I don't know what is the content of pu_one, pu_two ... pu_five, pu_six ... in your code; are there maybe some inflexion affixes? in that case the pattern for such a compound word can be \bPrefixStemSuffix1Suffix2Ending1\b etc. However, it could get quite complicated, if you try to deal with some specificities of natural languages with such straightforward approaches. Hope this helps a bit; Vlasta
-- http://mail.python.org/mailman/listinfo/python-list