Re: Text parsing via regex

Shane Geiger Mon, 08 Dec 2008 12:04:26 -0800

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/148061



def wrap(text, width):
   """
   A word-wrap function that preserves existing line breaks
   and most spaces in the text. Expects that existing line
   breaks are posix newlines (\n).
   """
   return reduce(lambda line, word, width=width: '%s%s%s' %
                 (line,
                  ' \n'[(len(line)-line.rfind('\n')-1
                        + len(word.split('\n',1)[0]
                             ) >= width)],
                  word),
                 text.split(' ')
                )

# 2 very long lines separated by a blank line
msg = """Arthur:  "The Lady of the Lake, her arm clad in the purest \
shimmering samite, held aloft Excalibur from the bosom of the water, \
signifying by Divine Providence that I, Arthur, was to carry \
Excalibur. That is why I am your king!"

Dennis:  "Listen. Strange women lying in ponds distributing swords is \
no basis for a system of government. Supreme executive power derives \
from a mandate from the masses, not from some farcical aquatic \
ceremony!\""""

# example: make it fit in 40 columns
print(wrap(msg,40))

# result is below
"""
Arthur:  "The Lady of the Lake, her arm
"""


Robocop wrote:

I'm having a little text parsing problem that i think would be really
quick to troubleshoot for someone more versed in python and Regexes.
I need to write a simple script that parses some arbitrarily long
string every 50 characters, and does not parse text in the middle of
words (but ultimately every parsed string should be 50 characters, so
adding in white spaces is necessary).  So i immediately came up with
something along the lines of:

string = "a bunch of nonsense that could be really long, or really
short depending on the situation"
r = re.compile(r".{50}")
m = r.match(string)

then i started to realize that i didn't know how to do exactly what i
wanted.  At this point i wanted to find a way to simply use something
like:

parsed_1, parsed_2,...parsed_n = m.groups()

However i'm having several problems.  I know that playskool regular
expression i wrote above will only parse every 50 characters, and will
blindly cut words in half if the parsed string doesn't end with a
whitespace.  I'm relatively new to regexes and i don't know how to
have it take that into account, or even what type of logic i would
need to fill in the extra whitespaces to make the string the proper
length when avoiding cutting words up.  So that's problem #1.  Problem
#2 is that because the string is of arbitrary length, i never know how
many parsed strings i'll have, and thus do not immediately know how
many variables need to be created to accompany them.  It's easy enough
with each pass of the function to find how many i will have by doing:
mag = len(string)
upper_lim = mag/50 + 1
But i'm not sure how to declare and set them to my parsed strings.
Now problem #1 isn't as pressing, i can technically get away with
cutting up the words, i'd just prefer not to.  The most pressing
problem right now is #2.  Any help, or suggestions would be great,
anything to get me thinking differently is helpful.
--
http://mail.python.org/mailman/listinfo/python-list



--
Shane Geiger
IT Director
National Council on Economic Education
[EMAIL PROTECTED]  |  402-438-8958  |  http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Text parsing via regex

Reply via email to