Hi folks,
I'm taking a look once again at fromfile() for reading text files. I
often have the need to read a LOT of numbers form a text file, and it
can actually be pretty darn slow do i the normal python way:
for line in file:
data = map(float, line.strip().split())
or various other versions that are similar. It really does take longer
to read the text, split it up, convert to a number, then put that number
into a numpy array, than it does to simply read it straight into the array.
However, as it stands, fromfile() turn out to be next to useless for
anything but whitespace separated text. Full set of ideas here:
http://projects.scipy.org/numpy/ticket/909
However, for the moment, I'm digging into the code to address a
particular problem -- reading files like this:
123, 65.6, 789
23, 3.2, 34
...
That is comma (or whatever) separated text -- pretty common stuff.
The problem with the current code is that you can't read more than one
line at time with fromfile:
a = np.fromfile(infile, sep=",")
will read until it doesn't find a comma, and thus only one line, as
there is no comma after each line. As this is a really typical case, I
think it should be supported.
Here is the question:
The work of finding the separator is done in:
multiarray/ctors.c: fromfile_skip_separator()
It looks like it wouldn't be too hard to add some code in there to look
for a newline, and consider that a valid separator. However, that would
break backward compatibility. So maybe a flag could be passed in, saying
you wanted to support newlines. The problem is that flag would have to
get passed all the way through to this function (and also for fromstring).
I also notice that it supports separators of arbitrary length, which I
wonder how useful that is. But it also does odd things with spaces
embedded in the separator:
", $ #" matches all of: ",$#" ", $#" ",$ #"
Is it worth trying to fix that?
In the longer term, it would be really nice to support comments as well,
tough that would require more of a re-factoring of the code, I think
(though maybe not -- I suppose a call to fromfile_skip_separator() could
look for a comment character, then if it found one, skip to where the
comment ends -- hmmm.
thanks for any feedback,
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
[email protected]
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion