I'm trying to write a fairly basic text parser to split up scenes and
acts in plays to put them into XML. I've managed to get the text split
into the blocks of scenes and acts and returned correctly but I'm
trying to refine this and get the relevant scene number when the split
is made but I keep getting an NoneType error trying to read the block
inside the for loop and nothing is being returned. I'd be grateful for
some suggestions as to how to get this working.
for scene in text.split('Scene'):
num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)
The first thing that occurs to me is that this should likely be a
raw string to get those backslashes into the regexp. Compare:
print "^\s\[0-9, i{1,4}, v]"
print r"^\s\[0-9, i{1,4}, v]"
Without an excerpt of the actual text (or at least the lead-in
for each scene), it's hard to tell whether this regex finds what
you expect. It doesn't look like your regexp finds what you may
think it does (it looks like you're using commas .
Just so you're aware, your split is a bit fragile too, in case
any lines contain "Scene". However, with a proper regexp, you
can even use it to split the scenes *and* tag the scene-number.
Something like
>>> import re
>>> s = """Scene [42]
... this is stuff in the 42nd scene
... Scene [IIV]
... stuff in the other scene
... """
>>> r = re.compile(r"Scene\s+\[(\d+|[ivx]+)]", re.I)
>>> r.split(s)[1:]
['42', '\nthis is stuff in the 42nd scene\n', 'IIV', '\nstuff
in the other scene\n']
>>> def grouper(iterable, groupby):
... iterable = iter(iterable)
... while True:
... yield [iterable.next() for _ in range(groupby)]
...
>>> for scene, content in grouper(r.split(s)[1:], 2):
... print "<div class='scene'><h1>%s</h1><p>%s</p></div>"
% (scene, content)
...
<div class='scene'><h1>42</h1><p>
this is stuff in the 42nd scene
</p></div>
<div class='scene'><h1>IIV</h1><p>
stuff in the other scene
</p></div>
Play accordingly.
-tkc
--
http://mail.python.org/mailman/listinfo/python-list