Pauli Virtanen wrote: > ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: > it also does odd things with spaces >> embedded in the separator: >> >> ", $ #" matches all of: ",$#" ", $#" ",$ #"
> That's a documented feature: Fair enough. OK, I've written a patch that allows newlines to be interpreted as separators in addition to whatever is specified in sep. In the process of testing, I found again these issues, which are still marked as "needs decision". http://projects.scipy.org/numpy/ticket/883 In short: what to do with missing values? I'd like to address this bug, but I need a decision to do so. My proposal: Raise an ValueError with missing values. Justification: No function should EVER return data that is not there. Period. It is simply asking for hard to find bugs. Therefore: fromstring("3, 4,,5", sep=",") Should never, ever, return: array([ 3., 4., 0., 5.]) Which is what it does now. bad. bad. bad. Alternatives: A) Raising a ValueError is the easiest way to get "proper" behavior. Folks can use a more sophisticated file reader if they want missing values handled. I'm willing to contribute this patch. B) If the dtype is a floating point type, NaN could fill in the missing values -- a fine idea, but you can't use it for integers, and zero is a really bad replacement! C) The user could specify what they want filled in for missing values. This is a fine idea, though I'm not sure I want to take the time to impliment it. Oh, and this is a bug too, with probably the same solution: In [20]: np.fromstring("hjba", sep=',') Out[20]: array([ 0.]) In [26]: np.fromstring("34gytf39", sep=',') Out[26]: array([ 34.]) One more unresolved question: what should: np.fromstring("3, 4, 5,", sep=",") return? it currently returns: array([ 3., 4., 5.]) which seems a bit inconsitent with missing value handling. I also found a bug: In [6]: np.fromstring("3, 4, 5 , ", sep=",") Out[6]: array([ 3., 4., 5., 0.]) so if there is some extra whitespace in there, it does return a missing value. With my proposal, that wouldn't happen, but you might get an exception. I think you should, but it'll be easier to implement my "allow newlines" code if not. so, should I do (A) ? Another question: I've got a patch mostly working (except for the above issues) that will allow fromfile/string to read multiline non-whitespace separated data in one shot: In [15]: str Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' In [16]: np.fromstring(str, sep=',', allow_newlines=True) Out[16]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I think this is a very helpful enhancement, and, as it is a new kwarg, backward compatible: 1) Might it be accepted for inclusion? 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit, but also long -- I used it for the flag name in the C code, too. 3) What C datatype should I use for a boolean flag? I used a char, but I don't know what the numpy standard is. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion