> Hi,
> I'm trying to write a fairly basic text parser to split up scenes and
> acts in plays to put them into XML. I've managed to get the text split
> into the blocks of scenes and acts and returned correctly but I'm
> trying to refine this and get the relevant scene number when the split
> is made but I keep getting an NoneType error trying to read the block
> inside the for loop and nothing is being returned. I'd be grateful for
> some suggestions as to how to get this working.
>
> for scene in text.split('Scene'):
> num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)
> textNum = num.match(scene)
> if textNum:
> print textNum
> else:
> print "No scene number"
> m = '<div type="scene>'
> m += scene
> m += '<\div>'
> print m
>
> Thanks, Iain
>
Don't forget that when you split the text, the first piece you get is what came
*before* the thing you split on so there won't be a scene number in the first
piece.
###
>>> print 'this foo 1 and that foo 2 and the end'.split('foo')
['this ', ' 1 and that ', ' 2 and the end']
###
If you have material before the first occurrence of the word 'Scene' you will
want to print that out without decoration.
Also, it looks like you are trying to say with your regex that the scene number
will come after some space and be a digit followed by a roman numeral of some
kind(?). If the number looks like this 1iii or 2iv or then you could split your
text with a regex rather than split:
###
>>> scene=re.compile('Scene\s+([0-9iIvV]+)')
>>> scene.split('The front matter Scene 1i The beginning was the best. Scene
>>> 1ii And then came the next act.')
['The front matter ', '1i', ' The beginning was the best. ', '1ii', ' And then
came the next act.']
>>>
###
The \s+ indicates that there will be at least one space character and maybe
more; the human error factor predicts that you will use more than one space
after the word scene, so \s+ just allows for that possibility.
The 0-9iIvV indicate the possible characters that might be part of your scene
number. Since it's unlikely that you will have any word appearing after Scene
that matches that pattern, it isn't written to be exact in specifying what
should come next. [1] The parenthesis tell what (beside the pieces left by
removing the split target) should be presented. In this case, the parenthesis
were put around the pattern that (maybe) represented your scene number and so
those are interspersed with the list of pieces.
/chris
[1] If it were more precise it might be '([1-9][0-9]*(iv|v?i{0,3}))' which
recognizes that a number should start with 1 or above and perhaps be followed
by 0 or more digits (including 0) and then come the roman numeral possibilities
(for up to viii) [2]. That "|" indicates "or" and the parenthesis go around
the roman numeral part to indicate that the "or" doesn't extend back to the
decimal digits. That extra set of parenthesis also means that the split will
now contain TWO captured pieces between each piece of script. If you put a ?
after the scene number part meaning that it may or may not be there, None will
be returned for the patterns that are not there:
###
>>> scene=re.compile('Scene\s+([1-9][0-9]*(iv|v?i{0,3}))?')
>>> scene.split('The front matter Scene 1i The beginning was the best. Scene
>>> 1ii And then came the next act. Scene The last one has no number.')
['The front matter ', '1i', 'i', ' The beginning was the best. ', '1ii', 'ii',
' And then came the next act. ', None, None, 'The last one has no number.']
>>>
###
[2] http://diveintopython.org/regular_expressions/roman_numerals.html
--
http://mail.python.org/mailman/listinfo/python-list