Hi Pierre, On Mon, 2012-07-16 at 01:54 -0500, Travis Oliphant wrote: > On Jul 16, 2012, at 1:52 AM, Pierre GM wrote: > > > Hello, > > I'm siding w/ Tom, Nathaniel and Travis. I don't think the change as > > it is is advisable. It's a regression, and breaking=bad. > > Now, I can understand your frustration, so, what about a trade-off? > > The first line w/ a comment after the first 'skip_header' ones > > should be parsed for column titles (and we call it > > 'first_commented_line'). We split it along the comment character, > > say, #. If there's some non-space character before the #, we keep > > this part of 'first_commented_line' as titles: that should work for > > your case. If the first non-space character was #, then what comes > > after are the titles (that's Tom's case and the current default). > > I'm not looking forward to introducing yet another keyword, > > genfromtxt is enough of a mess as it is (unless we add a > > 'need_coffee' one). > > What y'all think?
> That seems like an acceptable proposal --- it is consistent with > current behavior but also satisfies the use-case (without another > keyword which is a bonus). > So, > +1 from me. > -Travis > Thanks for jumping in, and for offering a compromise solution. I agree that genfromtxt() has too many kwargs—it took me several minutes of reading the docs to realize why it wasn't behaving as expected! To be ultra clear (since I want to code this), you are suggesting that 'first_commented_line' be a *new* accepted value for the kwarg 'names', to invoke the behaviour you suggest? --- If this IS what you mean, I'd counter-propose something in the same spirit, but a bit simpler…we let the kwarg 'skip_header' take some additional value, say int(0), int(-1), str('auto'), or True. In this case, instead of skipping a fixed number of lines, it will skip any number of consecutive empty OR commented lines; THEN apply the behaviour you describe. The semantics of this are more intuitive, because this is what I am really after: to *skip* a commented *header* of arbitrary length. So my four examples below could be parsed with: 1. genfromtxt(..., names=True) 2. genfromtxt(..., names=True, skip_header=True) 3. genfromtxt(..., names=True) 4. genfromtxt(..., names=True, skip_header=True) …crucially #1 avoids the regression. Does this seem good to everyone? --- But if this is NOT what you mean, then what you say does not actually work with the simple use-case of my Example #2 below. The first commented line is "# here is a..." with # as the first non-space character, so the part after becomes the names 'here', 'is', 'a' etc. In short, the code can't resolve the ambiguity without some extra information from the user. > > > On Jul 13, 2012 7:29 PM, "Paul Natsuo Kishimoto" > > <m...@paul.kishimoto.name> wrote: > > On Fri, 2012-07-13 at 12:13 -0400, Tom Aldcroft wrote: > > > On Fri, Jul 13, 2012 at 11:15 AM, Paul Natsuo Kishimoto > > > <m...@paul.kishimoto.name> wrote: > > > > Hello everyone, > > > > > > > > I am a longtime NumPy user, and I just filed my > > first contribution to > > > > the code as pull request to fix what I felt was a bug in > > the behaviour > > > > of genfromtxt() https://github.com/numpy/numpy/pull/351 > > > > It turns out this alters existing behaviour that some > > people may depend > > > > on, so I was encouraged to raise the issue on this list > > to see what the > > > > consensus was. > > > > > > > > This behaviour happens in the specific situation where: > > > > * Comments are used in the file (the default > > comment character is > > > > '#', which I'll use here), AND > > > > * The kwarg names=True is given. In this case, > > genfromtxt() is > > > > supposed to read an initial row containing the > > names of the > > > > columns and return an array with a structured > > dtype. > > > > > > > > Currently, these options work with a file like (Example > > #1): > > > > > > > > # gender age weight > > > > M 21 72.100000 > > > > F 35 58.330000 > > > > M 33 21.99 > > > > > > > > …but NOT with a file like (Example #2): > > > > > > > > # here is a general file comment > > > > # it is spread over multiple lines > > > > gender age weight > > > > M 21 72.100000 > > > > F 35 58.330000 > > > > M 33 21.99 > > > > > > > > …genfromtxt() believes the column names are 'here', > > 'is', 'a', etc., and > > > > thinks all of the columns are strings because 'gender', > > 'age' and > > > > 'weight' are not numbers. > > > > > > > > This is because genfromtxt() (after skipping a > > number of lines as > > > > specified in the optional kwarg skip_header) will use > > the *first* line > > > > it encounters to produce column names. If that line > > contains a comment > > > > character, genfromtxt() discards everything *up to and > > including* the > > > > comment character, and tries to use the content *after* > > the comment > > > > character as headers (Example 3): > > > > > > > > gender age weight # wrong column names > > > > M 21 72.100000 > > > > F 35 58.330000 > > > > M 33 21.99 > > > > > > > > …the resulting column names are 'wrong', 'column' and > > 'names'. > > > > > > > > My proposed change was that, if the first (or any > > subsequent) line > > > > contains a comment character, it should be treated as an > > *actual > > > > comment*, and discarded along with anything that follows > > it on the line. > > > > > > > > In Example 2, the result would be that the first > > two lines appear empty > > > > (no text before '#'), and the third line ("gender age > > weight") is used > > > > for column names. > > > > > > > > In Example 3, the result would be that "gender > > age weight" is used for > > > > column names while "# wrong column names" is ignored. > > > > > > > > BUT! > > > > > > > > In Example 1, the result would be that the first > > line appears empty, > > > > and "M 21 72.100000" are used for column names. > > > > > > > > In other words, this change would do away with the > > previous behaviour > > > > where the very first commented line was (magically?) > > treated not as a > > > > comment but instead as column headers. This might break > > some existing > > > > code. On the positive side, it would allow the user to > > be more liberal > > > > with the format of input files (Example 4): > > > > > > > > # here is a general file comment > > > > # the columns in this table are > > > > gender age weight # here is a comment on the > > header line > > > > # following this line are the data > > > > M 21 72.100000 > > > > F 35 58.330000 # here is a comment on a data > > line > > > > M 33 21.99 > > > > > > > > I feel that this is a better/more flexible behaviour for > > genfromtxt(), > > > > but—as stated—I am interested in your thoughts. > > > > > > > > Cheers, > > > > -- > > > > Paul Natsuo Kishimoto > > > > > > > > SM candidate, Technology & Policy Program (2012) > > > > Research assistant, http://globalchange.mit.edu > > > > https://paul.kishimoto.name +1 617 302 6105 > > > > > > > > _______________________________________________ > > > > NumPy-Discussion mailing list > > > > NumPy-Discussion@scipy.org > > > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > > > > > Hi Paul, > > > > > > At least in astronomy tabular files with the column > > definitions in the > > > first commented line are reasonably common. This is > > driven in part by > > > wide use of legacy packages like supermongo etc that don't > > have > > > intelligent table readers, so users document the column > > names as a > > > comment line. I think making this break might be > > unfortunate for > > > users in astronomy. > > > > > > Dealing with commented header definitions is annoying. > > Not that it > > > matters specifically for your genfromtext() proposal, but > > in the > > > asciitable reader this case is handled with a particular > > reader class > > > that expects the first comment line to contain the column > > definitions: > > > > > > > > > > http://cxc.harvard.edu/contrib/asciitable/#asciitable.CommentedHeader > > > > > > Cheers, > > > Tom > > > > Tom, > > > > Thanks for this information. In thinking about how people > > would work > > around this, I figured it would be fairly easy to discard a > > comment > > character that occurred as the very first character in a > > file, e.g.: > > > > raw = StringIO(open('example.txt').read()[1:]) > > data = numpy.genfromtxt(raw, comment='#', > > names=True) > > > > …but I realize that making this change in many places would > > still be an > > annoyance. > > > > I should perhaps also add that my view of 'proper' > > table formats is > > partly influenced by another plotting package, namely > > pgfplots for LaTeX > > (http://pgfplots.sourceforge.net/ , > > http://pgfplots.sourceforge.net/gallery.html) which uses > > uncommented > > headers. To the extent NumPy users are also LaTeX users, > > similar > > semantics could be more friendly. > > > > Looking forward to more input from other users, > > -- > > Paul Natsuo Kishimoto > > > > SM candidate, Technology & Policy Program (2012) > > Research assistant, http://globalchange.mit.edu > > https://paul.kishimoto.name +1 617 302 6105 -- Paul Natsuo Kishimoto SM candidate, Technology & Policy Program (2012) Research assistant, http://globalchange.mit.edu http://paul.kishimoto.name +1 617 302 6105
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion