Re: Parsing text

C or L Smith Wed, 06 May 2009 21:07:49 -0700

> Hi,
> I'm trying to write a fairly basic text parser to split up scenes and
> acts in plays to put them into XML. I've managed to get the text split
> into the blocks of scenes and acts and returned correctly but I'm
> trying to refine this and get the relevant scene number when the split
> is made but I keep getting an NoneType error trying to read the block
> inside the for loop and nothing is being returned. I'd be grateful for
> some suggestions as to how to get this working.
> 
> for scene in text.split('Scene'):
>     num = re.compile("^\s\[0-9, i{1,4}, v]", re.I)
>     textNum = num.match(scene)
>     if textNum:
>         print textNum
>     else:
>         print "No scene number"
>     m = '<div type="scene>'
>     m += scene
>     m += '<\div>'
>     print m
> 
> Thanks, Iain
>


Don't forget that when you split the text, the first piece you get is what came 
*before* the thing you split on so there won't be a scene number in the first 
piece.

###
>>> print 'this foo 1 and that foo 2 and the end'.split('foo')
['this ', ' 1 and that ', ' 2 and the end']
###

If you have material before the first occurrence of the word 'Scene' you will 
want to print that out without decoration.

Also, it looks like you are trying to say with your regex that the scene number 
will come after some space and be a digit followed by a roman numeral of some 
kind(?). If the number looks like this 1iii or 2iv or then you could split your 
text with a regex rather than split:

###
>>> scene=re.compile('Scene\s+([0-9iIvV]+)')
>>> scene.split('The front matter Scene 1i The beginning was the best. Scene  
>>> 1ii And then came the next act.')
['The front matter ', '1i', ' The beginning was the best. ', '1ii', ' And then 
came the next act.']
>>> 
###

The \s+ indicates that there will be at least one space character and maybe 
more; the human error factor predicts that you will use more than one space 
after the word scene, so \s+ just allows for that possibility.

The 0-9iIvV indicate the possible characters that might be part of your scene 
number. Since it's unlikely that you will have any word appearing after Scene 
that matches that pattern, it isn't written to be exact in specifying what 
should come next. [1] The parenthesis tell what (beside the pieces left by 
removing the split target) should be presented. In this case, the parenthesis 
were put around the pattern that (maybe) represented your scene number and so 
those are interspersed with the list of pieces.

/chris

[1] If it were more precise it might be '([1-9][0-9]*(iv|v?i{0,3}))' which 
recognizes that a number should start with 1 or above and perhaps be followed 
by 0 or more digits (including 0) and then come the roman numeral possibilities 
(for up to viii) [2].  That "|" indicates "or" and the parenthesis go around 
the roman numeral part to indicate that the "or" doesn't extend back to the 
decimal digits. That extra set of parenthesis also means that the split will 
now contain TWO captured pieces between each piece of script. If you put a ? 
after the scene number part meaning that it may or may not be there, None will 
be returned for the patterns that are not there:

###
>>> scene=re.compile('Scene\s+([1-9][0-9]*(iv|v?i{0,3}))?')
>>> scene.split('The front matter Scene 1i The beginning was the best. Scene  
>>> 1ii And then came the next act. Scene The last one has no number.')
['The front matter ', '1i', 'i', ' The beginning was the best. ', '1ii', 'ii', 
' And then came the next act. ', None, None, 'The last one has no number.']
>>> 
###

[2] http://diveintopython.org/regular_expressions/roman_numerals.html
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing text

Reply via email to