On Jun 4, 1:52 pm, Gerard Flanagan <[EMAIL PROTECTED]> wrote: > On Jun 2, 10:47 pm, Raymond Hettinger <[EMAIL PROTECTED]> wrote: > > > > > On Jun 2, 10:19 am, Steve Howell <[EMAIL PROTECTED]> wrote: > > > > George Sakkis produced the following cookbook recipe, > > > which addresses a common problem that comes up on this > > > mailing list: > > > ISTM, this is a common mailing list problem because it is fun > > to solve, not because people actually need it on a day-to-day basis. > > > In that spirit, it would be fun to compare several different > > approaches to the same problem using re.finditer, itertools.groupby, > > or the tokenize module. To get the ball rolling, here is one variant: > > > from itertools import groupby > > > def blocks(s, start, end): > > def classify(c, ingroup=[0], delim={start:2, end:3}): > > result = delim.get(c, ingroup[0]) > > ingroup[0] = result in (1, 2) > > return result > > return [tuple(g) for k, g in groupby(s, classify) if k == 1] > > > print blocks('the <quick> brown <fox> jumped', start='<', end='>') > > > One observation is that groupby() is an enormously flexible tool. > > Given a well crafted key= function, it makes short work of almost > > any data partitioning problem. > > Can anyone suggest a function that will split text by paragraphs, but > NOT if the paragraphs are contained within a [quote]...[/quote] > construct. In other words, the following text should yield 3 blocks > not 6: > > TEXT = ''' > Lorem ipsum dolor sit amet, consectetuer adipiscing elit. > Pellentesque dolor quam, dignissim ornare, porta et, > auctor eu, leo. Phasellus malesuada metus id magna. > > [quote] > Only when flight shall soar > not for its own sake only > up into heaven's lonely > silence, and be no more > > merely the lightly profiling, > proudly successful tool, > playmate of winds, beguiling > time there, careless and cool: > > only when some pure Whither > outweighs boyish insistence > on the achieved machine > > will who has journeyed thither > be, in that fading distance, > all that his flight has been. > [/quote] > > Integer urna nulla, tempus sit amet, ultrices interdum, > rhoncus eget, ipsum. Cum sociis natoque penatibus et > magnis dis parturient montes, nascetur ridiculus mus. > ''' > > Other info: > > * don't worry about nesting > * the [quote] and [/quote] musn't be stripped. > > Gerard
(Sorry if I ruined the parent thread.) FWIW, I didn't get a groupby solution but with some help from the Python Cookbook (O'Reilly), I came up with the following: import re RE_START_BLOCK = re.compile('^\[[\w|\s]*\]$') RE_END_BLOCK = re.compile('^\[/[\w|\s]*\]$') def iter_blocks(lines): block = [] inblock = False for line in lines: if line.isspace(): if inblock: block.append(line) elif block: yield block block = [] else: if RE_START_BLOCK.match(line): inblock = True elif RE_END_BLOCK.match(line): inblock = False block.append(line.lstrip()) if block: yield block -- http://mail.python.org/mailman/listinfo/python-list