In response to Danny's question about tokenising first, there are reasons why I don't want to do this - the initial problem was that filenames in my test data were being tokenised as separate words. EG. DataMarchAccounts.txt would be tokenised as two words, neither of which are real words that can be found in an English dictionary. (Often, filenames are not proper words, which is why I needed to delete the whole string - and by 'string' I mean any consecutive string of non-whitespace characters.) Because I don't want to subsequently analyse any 'non-words', only real words that will then be automatically checked against a lexicon.
Well - my code is all done now, apart from the tweaking of this one RE. BTW - I am new to Python and had never done any programming before that, so you may see some more questions from me in the future...
Cheers again, Debbie -- *************************************************** Debbie Elliott Computer Vision and Language Research Group, School of Computing, University of Leeds, Leeds LS2 9JT United Kingdom. Tel: 0113 3437288 Email: [EMAIL PROTECTED] *************************************************** _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor