On Fri, Jan 17, 2014 at 1:38 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote:
> > This thread is getting a little out of hand which is my fault for > initially mixing different topics in one mail, > still a bit mixed ;-) -- but I think the loadtxt issue requires a lot less discussion, so we're OK there. There have been a lot of notes here since I last commented, so I'm going stick with the loadtxt issues in this note: - no possibility to specify the encoding of a file in loadtxt > this is a missing feature, currently it uses the system default which is > good and should stay that way. > I disagree -- I think using the "system encoding" is a bad idea for a default -- I certainly am far more likely to get data files from some other system than my own -- and really unlikely to use the "system encoding" for any data files I write, either. And I'm not begin english-centered here -- my data files commonly do have non-ascii code in there, though frankly, they are either a mess or I know the encoding. What should be the default? latin-1 Why? Despite our desire to be non-english-focuses, most of what loadtxt does is parse files for numbers, maybe with a bit of text. Numbers are virtually always ascii-compatible (am I wrong about that? -- if so you'd damn well better know your encoding!). So it should be an ascii-compatible encoding. Why not ascii? -- because then it would barf on non-ascii text in the file -- really bad idea there. Why not utf-8 -- this is being *nic centric -- and utf-8 will wrk fine on ascii, but corrupt non-asci,, non-utf-8 data (i.e. any other encoding.) and may barf on some of ti too (not sure about that). latin-1 will never barf on any binary data, will successfully parse any numeric data (plus spaces, commas, etc.), and will preserve the bytes of an non-ascii content in the file. If you can set the encoding it's not a huge deal what the default is, but I will recommend that everyone always either sets it to a known encoding or uses latin-1 -- never the system encoding. One more point: on my system right now: In [15]: sys.getdefaultencoding() Out[15]: 'ascii' please don't make loadttxt start barfing on files I've been reading just fine for years.... It is only missing an option to tell it to treat it differently. > There should be little debate about changing the default, especially not > using latin1. The system default exists for a good reason. > Maybe, maybe not, but I submit that whatever that "good reason" is, it does not apply here! This is kin dof like datetime64 using the localle timezone -- makes it useless! > Note on linux it is UTF-8 which is a good choice. I'm not familiar with > windows but all programs should at least have the option to use UTF-8 as > output too. > should, yes, so, maybe, but: a) not all text data files are written recently or by recently updated software. b) This is kind of like saying we should have loadtxt default to utf-8, which wouldn't be the worst idea -- better than system default, but still not as good as latin-1 This is a simple question: Should the exact same file read fine with the exact same code on one machine, but not another? I don't think so. This has nothing to do with indexing or any kind of processing of the numpy > arrays. > agreed. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion