On Aug 2, 12:34 pm, John Nagle <na...@animats.com> wrote: > The regular expression "split" behaves slightly differently than string > split:
I'm going to argue that it's the string split that's behaving oddly. To see why, let's first look at some simple CSV values: cat,dog ,missing,,values, How many fields are on each line and what are they? Here's what re.split(',') says: >>> re.split(',', 'cat,dog') ['cat', 'dog'] >>> re.split(',', ',missing,,values,') ['', 'missing', '', 'values', ''] Note that the presence of missing values is clearly flagged via the presence of empty strings in the results. Now let's look at string split: >>> 'cat,dog'.split(',') ['cat', 'dog'] >>> ',missing,,values,'.split(',') ['', 'missing', '', 'values', ''] It's the same results. Let's try it again, but replacing the commas with spaces. >>> re.split(' ', 'cat dog') ['cat', 'dog'] >>> re.split(' ', ' missing values ') ['', 'missing', '', 'values', ''] >>> 'cat dog'.split(' ') ['cat', 'dog'] >>> ' missing values '.split(' ') ['', 'missing', '', 'values', ''] It's the same results; however many people don't like these results because they feel that whitespace occupies a privileged role. People generally agree that a string of consecutive commas means missing values, but a string of consecutive spaces just means someone held the space-bar down too long. To accommodate this viewpoint, the string split is special-cased to behave differently when None is passed as a separator. First, it splits on any number of whitespace characters, like this: >>> re.split('\s+', ' missing values ') ['', 'missing', 'values', ''] >>> re.split('\s+', 'cat dog') ['cat', 'dog'] But it also eliminates any empty strings from the head and tail of the list, because that's what people generally expect when splitting on whitespace: >>> 'cat dog'.split(None) ['cat', 'dog'] >>> ' missing values '.split(None) ['missing', 'values'] -- http://mail.python.org/mailman/listinfo/python-list