In article <mailman.5748.1390216721.18130.python-l...@python.org>, Ben Finney <ben+pyt...@benfinney.id.au> wrote:
> With a little experimenting I get: > > >>> p = re.compile('((?:CAA)+)?((?:TCT)+)?((?:TA)+)?') > >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA') > [('CAACAACAA', 'TCTTCTTCTTCT', 'TATATA'), ('', '', '')] Perhaps a matter of style, but I would have left off the ?: markers and done this: p = re.compile('((CAA)+)((TCT)+)((TA)+)') m = p.match('CAACAACAATCTTCTTCTTCTTATATA') print m.groups() $ python r.py ('CAACAACAA', 'CAA', 'TCTTCTTCTTCT', 'TCT', 'TATATA', 'TA') The ?: says, "match this group, but don't save it". The advantage of that is you don't get unwanted groups in your match object. The disadvantage is they make the pattern more difficult to read. My personal opinion is I'd rather make the pattern easier to read and just ignore the extra matches in the output (in this case, I want groups 0, 2, and 4). I also left off the outer ?s, because I think this better represents the intent. The pattern '((CAA)+)?((TCT)+)?((TA)+)?' matches, for example, an empty string; I suspect that's not what was intended. > Be aware that regex is not the solution to all parsing problems; for > many parsing problems it is an attractive but inappropriate tool. You > may need to construct a more specific parser for your needs. Even if > it's possible with regex, the resulting pattern may be so complex that > it's better to write it out more explicitly. Oh, posh. You are correct; regex is not the solution to all parsing problems, but it is a powerful tool which people should be encouraged to learn. For some problems, it is indeed the correct tool, and this seems like one of them. Discouraging people from learning about regexes is an educational anti-pattern which I see distressingly often on this newsgroup. Several lives ago, I worked in a molecular biology lab writing programs to analyze DNA sequences. Here's a common problem: "Find all the places where GACGTC or TTCGAA (or any of a similar set of 100 or so short patterns appear". I can't think of an easier way to represent that in code than a regex. Sure, it'll be a huge regex, which may take a long time to compile, but one of the nice things about these sorts of problems) is that the patterns you are looking for tend not to change very often. For example, the problem I mention in the preceding paragraph is finding restriction sites, i.e. the locations where restriction enzymes will cut a strand of DNA. There's a finite set of commercially available restriction enzymes, and that list doesn't change from month to month (at this point, maybe even from year to year). For more details, see http://bioinformatics.oxfordjournals.org/content/4/4/459.abstract Executive summary: I wrote my own regex compiler which was optimized for the types of patterns this problem required. -- https://mail.python.org/mailman/listinfo/python-list