On Mon, Jan 4, 2010 at 10:39 PM, <a...@ajackson.org> wrote: >>Hi folks, >> >>I'm taking a look once again at fromfile() for reading text files. I >>often have the need to read a LOT of numbers form a text file, and it >>can actually be pretty darn slow do i the normal python way: >> >>for line in file: >> data = map(float, line.strip().split()) >> >> >>or various other versions that are similar. It really does take longer >>to read the text, split it up, convert to a number, then put that number >>into a numpy array, than it does to simply read it straight into the array. >> >>However, as it stands, fromfile() turn out to be next to useless for >>anything but whitespace separated text. Full set of ideas here: >> >>http://projects.scipy.org/numpy/ticket/909 >> >>However, for the moment, I'm digging into the code to address a >>particular problem -- reading files like this: >> >>123, 65.6, 789 >>23, 3.2, 34 >>... >> >>That is comma (or whatever) separated text -- pretty common stuff. >> >>The problem with the current code is that you can't read more than one >>line at time with fromfile: >> >>a = np.fromfile(infile, sep=",") >> >>will read until it doesn't find a comma, and thus only one line, as >>there is no comma after each line. As this is a really typical case, I >>think it should be supported. >> >>Here is the question: >> >>The work of finding the separator is done in: >> >>multiarray/ctors.c: fromfile_skip_separator() >> >>It looks like it wouldn't be too hard to add some code in there to look >>for a newline, and consider that a valid separator. However, that would >>break backward compatibility. So maybe a flag could be passed in, saying >>you wanted to support newlines. The problem is that flag would have to >>get passed all the way through to this function (and also for fromstring). >> >>I also notice that it supports separators of arbitrary length, which I >>wonder how useful that is. But it also does odd things with spaces >>embedded in the separator: >> >>", $ #" matches all of: ",$#" ", $#" ",$ #" >> >>Is it worth trying to fix that? >> >> >>In the longer term, it would be really nice to support comments as well, >>tough that would require more of a re-factoring of the code, I think >>(though maybe not -- I suppose a call to fromfile_skip_separator() could >>look for a comment character, then if it found one, skip to where the >>comment ends -- hmmm. >> >>thanks for any feedback, >> >>-Chris >> > > I agree. I've tried using it, and usually find that it doesn't quite get > there. > > I rather like the R command(s) for reading text files - except then I have to > use R which is painful after using python and numpy. Although ggplot2 is > awfully nice too ... but that is a later post. > > read.table(file, header = FALSE, sep = "", quote = "\"'", > dec = ".", row.names, col.names, > as.is = !stringsAsFactors, > na.strings = "NA", colClasses = NA, nrows = -1, > skip = 0, check.names = TRUE, fill = !blank.lines.skip, > strip.white = FALSE, blank.lines.skip = TRUE, > comment.char = "#", > allowEscapes = FALSE, flush = FALSE, > stringsAsFactors = default.stringsAsFactors(), > fileEncoding = "", encoding = "unknown") > > read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", > fill = TRUE, comment.char="", ...) > > read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",", > fill = TRUE, comment.char="", ...) > > read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".", > fill = TRUE, comment.char="", ...) > > read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",", > fill = TRUE, comment.char="", ...) > > > There is really only read.table, the others are just aliases with different > defaults. But the flexibility is great, as you can see.
Aren't the newly improved numpy.genfromtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, skiprows=0, converters=None, missing='', missing_values=None, usecols=None, names=None, excludelist=None, deletechars=None, case_sensitive=True, unpack=None, usemask=False, loose=True) and friends indented to handle all this Josef > > -- > ----------------------------------------------------------------------- > | Alan K. Jackson | To see a World in a Grain of Sand | > | a...@ajackson.org | And a Heaven in a Wild Flower, | > | www.ajackson.org | Hold Infinity in the palm of your hand | > | Houston, Texas | And Eternity in an hour. - Blake | > ----------------------------------------------------------------------- > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion