Re: Parsing text

Tim Chase Wed, 06 May 2009 12:00:04 -0700

I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.


for scene in text.split('Scene'):
    num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)

The first thing that occurs to me is that this should likely be araw string to get those backslashes into the regexp. Compare:


  print "^\s\[0-9, i{1,4}, v]"
  print r"^\s\[0-9, i{1,4}, v]"

Without an excerpt of the actual text (or at least the lead-infor each scene), it's hard to tell whether this regex finds whatyou expect. It doesn't look like your regexp finds what you maythink it does (it looks like you're using commas .

Just so you're aware, your split is a bit fragile too, in caseany lines contain "Scene". However, with a proper regexp, youcan even use it to split the scenes *and* tag the scene-number.Something like


  >>> import re
  >>> s = """Scene [42]
  ... this is stuff in the 42nd scene
  ... Scene [IIV]
  ... stuff in the other scene
  ... """
  >>> r = re.compile(r"Scene\s+\[(\d+|[ivx]+)]", re.I)
  >>> r.split(s)[1:]

['42', '\nthis is stuff in the 42nd scene\n', 'IIV', '\nstuffin the other scene\n']

  >>> def grouper(iterable, groupby):
  ...     iterable = iter(iterable)
  ...     while True:
  ...             yield [iterable.next() for _ in range(groupby)]
  ...

  >>> for scene, content in grouper(r.split(s)[1:], 2):

... print "<div class='scene'><h1>%s</h1><p>%s</p></div>"% (scene, content)

  ...
  <div class='scene'><h1>42</h1><p>
  this is stuff in the 42nd scene
  </p></div>
  <div class='scene'><h1>IIV</h1><p>
  stuff in the other scene
  </p></div>

Play accordingly.

-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

Reply via email to