On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: > On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto > <m...@paul.kishimoto.name> wrote: > > Hello everyone, > > > > I am a longtime NumPy user, and I just filed my first contribution > > to > > the code as pull request to fix what I felt was a bug in the behaviour > > of genfromtxt() https://github.com/numpy/numpy/pull/351 > > It turns out this alters existing behaviour that some people may depend > > on, so I was encouraged to raise the issue on this list to see what the > > consensus was. > > > > This behaviour happens in the specific situation where: > > * Comments are used in the file (the default comment character is > > '#', which I'll use here), AND > > * The kwarg names=True is given. In this case, genfromtxt() is > > supposed to read an initial row containing the names of the > > columns and return an array with a structured dtype. > > > > Currently, these options work with a file like (Example #1): > > > > # gender age weight > > M 21 72.100000 > > F 35 58.330000 > > M 33 21.99 > > > > …but NOT with a file like (Example #2): > > > > # here is a general file comment > > # it is spread over multiple lines > > gender age weight > > M 21 72.100000 > > F 35 58.330000 > > M 33 21.99 > > > > …genfromtxt() believes the column names are 'here', 'is', 'a', etc., and > > thinks all of the columns are strings because 'gender', 'age' and > > 'weight' are not numbers. > > > > This is because genfromtxt() (after skipping a number of lines as > > specified in the optional kwarg skip_header) will use the *first* line > > it encounters to produce column names. If that line contains a comment > > character, genfromtxt() discards everything *up to and including* the > > comment character, and tries to use the content *after* the comment > > character as headers (Example 3): > > > > gender age weight # wrong column names > > M 21 72.100000 > > F 35 58.330000 > > M 33 21.99 > > > > …the resulting column names are 'wrong', 'column' and 'names'. > > > > My proposed change was that, if the first (or any subsequent) line > > contains a comment character, it should be treated as an *actual > > comment*, and discarded along with anything that follows it on the line. > > > > In Example 2, the result would be that the first two lines appear > > empty > > (no text before '#'), and the third line ("gender age weight") is used > > for column names. > > > > In Example 3, the result would be that "gender age weight" is used > > for > > column names while "# wrong column names" is ignored. > > > > BUT! > > > > In Example 1, the result would be that the first line appears empty, > > and "M 21 72.100000" are used for column names. > > > > In other words, this change would do away with the previous behaviour > > where the very first commented line was (magically?) treated not as a > > comment but instead as column headers. This might break some existing > > code. On the positive side, it would allow the user to be more liberal > > with the format of input files (Example 4): > > > > # here is a general file comment > > # the columns in this table are > > gender age weight # here is a comment on the header line > > # following this line are the data > > M 21 72.100000 > > F 35 58.330000 # here is a comment on a data line > > M 33 21.99 > > > > I feel that this is a better/more flexible behaviour for genfromtxt(), > > but—as stated—I am interested in your thoughts. > > > > Cheers, > > -- > > Paul Natsuo Kishimoto > > > > SM candidate, Technology & Policy Program (2012) > > Research assistant, http://globalchange.mit.edu > > https://paul.kishimoto.name +1 617 302 6105 > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > Hi Paul, > > At least in astronomy tabular files with the column definitions in the > first commented line are reasonably common. This is driven in part by > wide use of legacy packages like supermongo etc that don't have > intelligent table readers, so users document the column names as a > comment line. I think making this break might be unfortunate for > users in astronomy. > > Dealing with commented header definitions is annoying. Not that it > matters specifically for your genfromtext() proposal, but in the > asciitable reader this case is handled with a particular reader class > that expects the first comment line to contain the column definitions: > > http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader > > Cheers, > Tom
Tom, Thanks for this information. In thinking about how people would work around this, I figured it would be fairly easy to discard a comment character that occurred as the very first character in a file, e.g.: raw = StringIO(open('example.txt').read()[1:]) data = numpy.genfromtxt(raw, comment='#', names=True) …but I realize that making this change in many places would still be an annoyance. I should perhaps also add that my view of 'proper' table formats is partly influenced by another plotting package, namely pgfplots for LaTeX (http://pgfplots.sourceforge.net/ , http://pgfplots.sourceforge.net/gallery.html) which uses uncommented headers. To the extent NumPy users are also LaTeX users, similar semantics could be more friendly. Looking forward to more input from other users, -- Paul Natsuo Kishimoto SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu https://paul.kishimoto.name +1 617 302 6105
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion