>Hi folks, > >I'm taking a look once again at fromfile() for reading text files. I >often have the need to read a LOT of numbers form a text file, and it >can actually be pretty darn slow do i the normal python way: > >for line in file: > data = map(float, line.strip().split()) > > >or various other versions that are similar. It really does take longer >to read the text, split it up, convert to a number, then put that number >into a numpy array, than it does to simply read it straight into the array. > >However, as it stands, fromfile() turn out to be next to useless for >anything but whitespace separated text. Full set of ideas here: > >http://projects.scipy.org/numpy/ticket/909 > >However, for the moment, I'm digging into the code to address a >particular problem -- reading files like this: > >123, 65.6, 789 >23, 3.2, 34 >... > >That is comma (or whatever) separated text -- pretty common stuff. > >The problem with the current code is that you can't read more than one >line at time with fromfile: > >a = np.fromfile(infile, sep=",") > >will read until it doesn't find a comma, and thus only one line, as >there is no comma after each line. As this is a really typical case, I >think it should be supported. > >Here is the question: > >The work of finding the separator is done in: > >multiarray/ctors.c: fromfile_skip_separator() > >It looks like it wouldn't be too hard to add some code in there to look >for a newline, and consider that a valid separator. However, that would >break backward compatibility. So maybe a flag could be passed in, saying >you wanted to support newlines. The problem is that flag would have to >get passed all the way through to this function (and also for fromstring). > >I also notice that it supports separators of arbitrary length, which I >wonder how useful that is. But it also does odd things with spaces >embedded in the separator: > >", $ #" matches all of: ",$#" ", $#" ",$ #" > >Is it worth trying to fix that? > > >In the longer term, it would be really nice to support comments as well, >tough that would require more of a re-factoring of the code, I think >(though maybe not -- I suppose a call to fromfile_skip_separator() could >look for a comment character, then if it found one, skip to where the >comment ends -- hmmm. > >thanks for any feedback, > >-Chris >
I agree. I've tried using it, and usually find that it doesn't quite get there. I rather like the R command(s) for reading text files - except then I have to use R which is painful after using python and numpy. Although ggplot2 is awfully nice too ... but that is a later post. read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown") read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="", ...) read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",", fill = TRUE, comment.char="", ...) read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".", fill = TRUE, comment.char="", ...) read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",", fill = TRUE, comment.char="", ...) There is really only read.table, the others are just aliases with different defaults. But the flexibility is great, as you can see. -- ----------------------------------------------------------------------- | Alan K. Jackson | To see a World in a Grain of Sand | | a...@ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | ----------------------------------------------------------------------- _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion