Finally...I'm Lost :-}

Vlastimil Brom Thu, 29 Nov 2007 03:20:27 -0800

2007/11/28, Victor Subervi <[EMAIL PROTECTED]>:
>
> Hi;
> I am trying to find words in a document that are identical to any word in
> a vocabulary list, to replace that word with special markup. Let's say the
> word is "dharma". I don't want to replace the first few letters of, say
> "dharmawuhirfuhi". Also, to make matters more difficult, if the word
> "adharma" is found in the document, I need to replace that with special
> markup, too. (In Sanskrit, "a" preceding a word negates the word.) But I
> don't want to replace "adharma" and then go off and replace the "dharma" in
> "adharma", thus having nested markup. Now, I tried separating out all the
> words in the line (I go through the doc line by line), but then, of course,
> I lost all the punctuation! So now I have this code:
>
>                 for word in vocab:
>                         aword = "a" + word
>                         try:
>                                 line = re.sub(aword, pu_four + aword +
> pu_five + aword + pu_six, line)
>                         except:
>                                 pass
>                         try:
>                                 line = re.sub(word, pu_one + word + pu_two
> + word + pu_three, line)
>                         except:
>                                 pass
>
> which, of course, ends up breaking all the above! Can someone send me a
> shovel to dig my way out of this mess?
> TIA,
> Victor
>
> --
> http://mail.python.org/mailman/listinfo/python-list




Hi,
I'm not quite sure, what the try - else clauses are expected to catch (if
there is no match, the replace function simply leaves the original string;
there would be at least exceptions with invalid patterns),
but if I understand the problem correctly, I would use the usual regexp
means to match whole words only.
It would be the \b metacharacter indicating word boundary; e.g.the
patern \bdharma\b should only match a complete word "dharma" but not
"adharma
" or "dharmawuhirfuhi"
see http://docs.python.org/lib/re-syntax.html for details about the re
patterns (especially \b can be unicode or locale dependent, but it seems,
that you are using a basic latin transcription, so it wouldn't matter).
However I don't know what is the content of pu_one, pu_two ... pu_five,
pu_six ... in your code; are there maybe some inflexion
affixes? in that case the pattern for such a compound word can be
\bPrefixStemSuffix1Suffix2Ending1\b etc.


However, it could get quite complicated, if
you try to deal with some specificities of natural languages with such
straightforward approaches.

Hope this helps a bit;
 Vlasta

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Try/Except/Yield/Finally...I'm Lost :-}

Reply via email to