[Numpy-discussion] fromfile() improvements (was: planning for numpy 1.3.0 release)

Christopher Barker Tue, 09 Sep 2008 22:43:46 -0700

Stéfan van der Walt wrote:
> 2008/9/9 Christopher Barker <[EMAIL PROTECTED]>:


>> Anyone want to help with improvements to fromfile() for text files?
> 
> This is low hanging fruit for anyone with some experience in C.  We
> can definitely get it done for 1.3.  Chris, would you file a ticket
> and add the detail from your mailing list posts, if that hasn't
> already been done?

Done:

http://scipy.org/scipy/numpy/ticket/909

( By the way, is there a way to fix the typo in the ticket title? --oops!)

There are a few fromfile() related tickets that I referenced as well.

It's not totally straightforward what should be done, so I've included 
the text of the ticket here to start a discussion:


Proposed Enhancements and bug fixes for fromfile() and fromstring() text 
handling:

Motivation:

The goal of the fromfile() text file handling capability is to enable 
users to write code that can read a lot of numbers from a text file into 
an array. Python provides a lot of nifty text processing capabilities, 
and there are a number of higher level facilities for reading blocks of 
data (including numpy.loadtxt). These are very capable, but there really 
is a significant performance hit, at least when loading 10s of thousands 
of numbers into a file.

We don't want to write all of loadtxt() and friends in C. Rather, the 
goal is to allow the simple cases to be done very efficiently, and 
hopefully fancier text reading packages can build on it to add more 
features.

Unfortunately, the current (numpy version 1.2) version has a few bugs 
and limitations that keep of from being nearly as useful as it could be.
Possible features:

     * Create fromtextfile() and fromtextstring functions, distinct from 
fromfile() and fromstring(). It really is a different functionality. 
fromfile() could still call fromtextfile() for backward compatibility.

     * Allow more than one separator? for example, a comma or 
whitespace? In the general case, the user could perhaps specify any 
number of separators, though I doubt that would be useful in practice. 
At the very least, however, fromtextfile() should support reading files 
that look like:

       43.5, 345.6, 123.456, 234.33
       34.5, 22.57, 2345,  2345, 252
       ...

That is, comma separated, but being able to read multiple lines in one shot.

The easiest way to support that would probably be to always allow 
whitespace as a separator, and add the one passed in. I can't think of a 
reason not to do this, but maybe I'm not very imaginative.

     * Allow the user to specify a shape for the output array. There may 
be little point, as all this does is save a calls to reshape(), but it 
may be another way to support the above. i.e. you could read that data 
with:

     a = np.fromtextfile(infile, dtype=np.float, sep=',', shape=(-1, 4))

     Then it would know to skip the newlines every 4 elements.

     * Allow the user to specify a comment string. The reader would then 
skip everything in the file between the comment string and a newline. 
Maybe Universal newline -- any of \r, \n or \r\n. Or simply expect that 
the user has opened the file with mode 'U' if they want that. This could 
also be extended to support C-style comments with an opening and closing 
character sequence, but that's a lot less common.

     * Allow the user to specify a Locale. It may be best to be able to 
specify a locale, rather than relying on the system on (whether '.' or 
',' is the decimal separator, for instance. (ticket #884)

     * parsing of "Inf" and the like that doesn't depend on system 
(ticket #510). This would be nice, but maybe too difficult -- would we 
need to write our own scanf?

Bugs to be fixed: ¶

     * fromfile() and fromstring handling malformed data poorly: ticket 
#883

     * Any others?


NOTE: my C is pretty lame, or I'd do some of this. I could help out with 
writing tests, etc. though.

Thanks all,

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[EMAIL PROTECTED]

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] fromfile() improvements (was: planning for numpy 1.3.0 release)

Reply via email to