Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Ralf Gommers
On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey bsout...@gmail.com wrote:

 **
 On 12/14/2011 01:03 AM, Chris Barker wrote:



 On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers ralf.gomm...@googlemail.com
  wrote:


   genfromtxt sure looks close for an API


 This I don't agree with. It has a huge amount of keywords that just
 confuse or intimidate a beginning user. There should be a dead simple
 interface, even the loadtxt API is on the heavy side.


 well, yes, though it does do a lot -- do you have a smpler one in mind?

 Just looking at what I normally wouldn't need for simple data files and/or
what a beginning user won't understand at once, the `unpack` and `ndmin`
keywords could certainly be left out. `converters` is also questionable.
That's probably as simple as it can get.

Note that I don't think this should be changed now, that's not worth the
trouble.

  But anyway, the really simple cases, are reallly simle, even with
 genfromtxt.

 I guess it's a matter of debate about what is a better API:

 a few functions, each adding a layer of sophistication

 or

 one function, with layers of sophistication added with an array of keyword
 arguments.

 There's always a trade-off, but looking at the docstring for genfromtxt
should make it an easy call in this case.

  In either case, though I wish the multiple functionality built on the
 same, well optimized core code.

 I wish that too, but I'm fairly certain that you can't write that core
code with the ability to handle missing and irregular data and make it
close to the same speed as an optimized reader for regular data.

  I am not sure that you can even create a simple API here as even Python's
 csv module is rather complex especially when it just reads data as strings.
 It also 'hides' many arguments in the Dialect class although these are just
 the collection of 7 'fmtparam' arguments. It also provides the Sniffer
 class that tries to find correct format that can then be passed to the
 reader function. Then you still have to convert the data into the required
 types - another set of arguments as well as yet another pass through the
 data.

 In comparison, genfromtxt can perform sniffing


I assume you mean the ``dtype=None`` example in the docstring? That works
to some extent, but you still need to specify the delimiter. I commented on
that on the loadtable PR.


 and both genfromtxt and loadtxt can read and convert the data. These also
 add some useful features like skipping rows (start, end and commented) and
 columns. However, it could be possible to create a sniffer function and a
 single data reader function leading to a 'simple' reader function but that
 probably would not change the API of the underlying data reader function.


Better auto-detection of things like delimiters would indeed be quite
useful.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Benjamin Root
On Wed, Dec 14, 2011 at 1:22 PM, Ralf Gommers
ralf.gomm...@googlemail.comwrote:



 On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey bsout...@gmail.com wrote:

 **
 On 12/14/2011 01:03 AM, Chris Barker wrote:



 On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers 
 ralf.gomm...@googlemail.com wrote:


   genfromtxt sure looks close for an API


 This I don't agree with. It has a huge amount of keywords that just
 confuse or intimidate a beginning user. There should be a dead simple
 interface, even the loadtxt API is on the heavy side.


 well, yes, though it does do a lot -- do you have a smpler one in mind?

 Just looking at what I normally wouldn't need for simple data files
 and/or what a beginning user won't understand at once, the `unpack` and
 `ndmin` keywords could certainly be left out. `converters` is also
 questionable. That's probably as simple as it can get.


Just my two cents (and I was one of those who championed its inclusion),
the ndmin feature is designed to prevent unexpected results that users
(particularly beginners) may encounter with their datasets.  Now, maybe it
might be difficult to tell a beginner *why* they might need to be aware of
it, but it is very easy to describe *how* to use.  How many dimensions is
your data? Two? Ok, just set ndmin=2 and you are good to go!

Cheers!
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Chris Barker
On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root ben.r...@ou.edu wrote:
 well, yes, though it does do a lot -- do you have a smpler one in mind?

 Just looking at what I normally wouldn't need for simple data files and/or
 what a beginning user won't understand at once, the `unpack` and `ndmin`
 keywords could certainly be left out. `converters` is also questionable.
 That's probably as simple as it can get.

this may be a function of a well written doc string -- if it is clear
to the newbie that  all the rest of this you don't need unless you
have a wierd data file, then extra keyword arguments don't really
hurt.

A few examples of the basic use-cases go a long way.

And yes, the core reader for the complex cases isn't going to e fast
(it's going to be complex C code...). but we could still have a core
reader that handled most cases.

Anyway, I think it's time write code, and see if it can be rolled in somehow...

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Ralf Gommers
On Wed, Dec 14, 2011 at 9:54 PM, Chris Barker chris.bar...@noaa.gov wrote:

 On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root ben.r...@ou.edu wrote:
  well, yes, though it does do a lot -- do you have a smpler one in mind?
 
  Just looking at what I normally wouldn't need for simple data files
 and/or
  what a beginning user won't understand at once, the `unpack` and `ndmin`
  keywords could certainly be left out. `converters` is also questionable.
  That's probably as simple as it can get.

 this may be a function of a well written doc string -- if it is clear
 to the newbie that  all the rest of this you don't need unless you
 have a wierd data file, then extra keyword arguments don't really
 hurt.

 A few examples of the basic use-cases go a long way.

 And yes, the core reader for the complex cases isn't going to e fast
 (it's going to be complex C code...). but we could still have a core
 reader that handled most cases.

 Okay, now we're on the same page I think.


 Anyway, I think it's time write code, and see if it can be rolled in
 somehow...

 Agreed.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Chris Barker
NOTE:

Let's keep this on the list.

On Tue, Dec 13, 2011 at 9:19 AM, denis denis-bz...@t-online.de wrote:

 Chris,
  unified, consistent save / load is a nice goal

 1) header lines with date, pwd etc.: where'd this come from ?

# (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
# 80.6 % correct -- user info
  24539 4 526
...

I'm not sure I understand what you are expecting here: What would be
automatic? if itparses a datetime on the header, what would it do with it?
But anyway, this seems to me:
  - very application specific -- this is for the users code to write
  - not what we are talking about at this point anyway -- I think this
discussion is about a lower-level, does-the-simple-things-fast reader --
that may or may not be able to form the basis of a higher-level fuller
featured reader.


 2) read any CSVs: comma or blank-delimited, with/without column names,
a la loadcsv() below


yup -- though the column name reading would be part of a higher-level
reader as far as I'm concerned.


 3) sparse or masked arrays ?

 sparse probably not, that seem pretty domain dependent to me -- though
hopefully one could build such a thing on top of the lower level reader.
 Masked support would be good -- once we're convinced what the future of
masked arrays are in numpy. I was thinking that the masked array issue
would really be a higher-level feature -- it certainly could be if you need
to mask special value stype files (i.e. ), but we may have to build
it into the lower level reader for cases where the mask is specified by
non-numerical values -- i.e. there are some met files that use MM or some
other text, so you can't put it into a numerical array first.


 Longterm wishes: beyond the scope of one file - one array
 but essential for larger projects:
 1) dicts / dotdicts:
Dotdict( A=anysizearray, N=scalar ... ) - a directory of little
 files
is easy, better than np.savez
(Haven't used hdf5, I believe Matlabv7  does.)

 2) workflows: has anyone there used visTrails ?


outside of the spec of this thread...


 Anyway it seems to me (old grey cynic) that Numpy/scipy developers
 prefer to code first, spec and doc later. Too pessimistic ?


Well, I think many of us believe in a more agile style approach --
incremental development. But really, as an open source project, it's really
about scratching an itch -- so there is usually a spec in mind for the itch
at hand. In this case, however, that has been a weakness -- clearly a
number of us hav written small solutions to our particular problem at hand,
but no we haven't arrived at a more general purpose solution yet. So a bit
of spec-ing ahead of time may be called for.

On that:

Ive been thinking from teh botom-up -- imaging what I need for the simple
case, and how it might apply to more complex cases -- but maybe we should
think about this another way:

What we're talking about here is really about core software engineering --
optimization. It's easy to write a pure-python simple file parser, and
reasonable to write a complex one (genfromtxt) -- the issue is performance
-- we need some more C (or Cython) code to really speed it up, but none of
us wants to write the complex case code in C. So:

genfromtxt is really nice for many of the complex cases. So perhaps
another approach is to look at genfromtxt, and see what
high performance lower-level functionality we could develop that could make
it fast -- then we are done.

This actually mirrors exactly what we all usually recommend for python
development in general -- write it in Python, then, if it's really not fast
enough, write the bottle-neck in C.

So where are the bottle necks in genfromtxt? Are there self-contained
portions that could be re-written in C/Cython?

-Chris






-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Bruce Southey

On 12/13/2011 12:08 PM, Chris Barker wrote:

NOTE:

Let's keep this on the list.

On Tue, Dec 13, 2011 at 9:19 AM, denis denis-bz...@t-online.de 
mailto:denis-bz...@t-online.de wrote:


Chris,
 unified, consistent save / load is a nice goal

1) header lines with date, pwd etc.: where'd this come from ?

   # (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
   # 80.6 % correct -- user info
 24539 4 526
   ...

I'm not sure I understand what you are expecting here: What would be 
automatic? if itparses a datetime on the header, what would it do with 
it? But anyway, this seems to me:

  - very application specific -- this is for the users code to write
  - not what we are talking about at this point anyway -- I think this 
discussion is about a lower-level, does-the-simple-things-fast reader 
-- that may or may not be able to form the basis of a higher-level 
fuller featured reader.


2) read any CSVs: comma or blank-delimited, with/without column names,
   a la loadcsv() below


yup -- though the column name reading would be part of a higher-level 
reader as far as I'm concerned.


3) sparse or masked arrays ?

sparse probably not, that seem pretty domain dependent to me -- though 
hopefully one could build such a thing on top of the lower level 
reader.  Masked support would be good -- once we're convinced what the 
future of masked arrays are in numpy. I was thinking that the masked 
array issue would really be a higher-level feature -- it certainly 
could be if you need to mask special value stype files (i.e. ), 
but we may have to build it into the lower level reader for cases 
where the mask is specified by non-numerical values -- i.e. there are 
some met files that use MM or some other text, so you can't put it 
into a numerical array first.



Longterm wishes: beyond the scope of one file - one array
but essential for larger projects:
1) dicts / dotdicts:
   Dotdict( A=anysizearray, N=scalar ... ) - a directory of little
files
   is easy, better than np.savez
   (Haven't used hdf5, I believe Matlabv7  does.)

2) workflows: has anyone there used visTrails ?


outside of the spec of this thread...


Anyway it seems to me (old grey cynic) that Numpy/scipy developers
prefer to code first, spec and doc later. Too pessimistic ?


Well, I think many of us believe in a more agile style approach -- 
incremental development. But really, as an open source project, it's 
really about scratching an itch -- so there is usually a spec in mind 
for the itch at hand. In this case, however, that has been a weakness 
-- clearly a number of us hav written small solutions to 
our particular problem at hand, but no we haven't arrived at a more 
general purpose solution yet. So a bit of spec-ing ahead of time may 
be called for.


On that:

Ive been thinking from teh botom-up -- imaging what I need for the 
simple case, and how it might apply to more complex cases -- but maybe 
we should think about this another way:


What we're talking about here is really about core software 
engineering -- optimization. It's easy to write a pure-python simple 
file parser, and reasonable to write a complex one (genfromtxt) -- the 
issue is performance -- we need some more C (or Cython) code to really 
speed it up, but none of us wants to write the complex case code in C. So:


genfromtxt is really nice for many of the complex cases. So perhaps 
another approach is to look at genfromtxt, and see what 
high performance lower-level functionality we could develop that could 
make it fast -- then we are done.


This actually mirrors exactly what we all usually recommend for python 
development in general -- write it in Python, then, if it's really not 
fast enough, write the bottle-neck in C.


So where are the bottle necks in genfromtxt? Are there self-contained 
portions that could be re-written in C/Cython?


-Chris



Reading data is hard and writing code that suits the diversity in the 
Numerical Python community is even harder!


Both loadtxt and genfromtxt functions (other functions are perhaps less 
important) perhaps need an upgrade to incorporate the new NA object. I 
think that adding the NA object will simply some of the process because 
invalid data (missing or a string in a numerical format) can be set to 
NA without requiring the creation of a  new masked array or returning an 
error.


Here I think loadtxt is a better target than genfromtxt because, as I 
understand it, it assumes the user really knows the data. Whereas 
genfromtxt can ask the data for the appropriatye format.


So I agree that new 'superfast custom CSV reader for well-behaved data' 
function would be rather useful especially as an replacement for 
loadtxt. By that I mean reading data using a user specified format that 
essentially follows the CSV format 
(http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to 
allow for 

Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Chris Barker
On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey bsout...@gmail.com wrote:

 **
 Reading data is hard and writing code that suits the diversity in the
 Numerical Python community is even harder!


yup

Both loadtxt and genfromtxt functions (other functions are perhaps less
 important) perhaps need an upgrade to incorporate the new NA object.


yes, if we are satisfiedthat the new NA object is, in fact, the way of the
future.


 Here I think loadtxt is a better target than genfromtxt because, as I
 understand it, it assumes the user really knows the data. Whereas
 genfromtxt can ask the data for the appropriatye format.

 So I agree that new 'superfast custom CSV reader for well-behaved data'
 function would be rather useful especially as an replacement for loadtxt.
 By that I mean reading data using a user specified format that essentially
 follows the CSV format (
 http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to
 allow for NA object, skipping lines and user-defined delimiters.


I think that ideally, there could be one interface to reading tabular data
-- hopefully, it would be easy for the user to specify what the want, and
if they don't the code tries to figure it out. Also, under the hood, the
easy cases are special-cased to high-performing versions.

genfromtxt sure looks close for an API -- it just needs the high
performance special cases under the hood. It may be that the way it's
designed makes it very difficult to do that, though -- I haven't looked
closely enough to tell.

At least that's what I'm thinking at the moment.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Ralf Gommers
On Tue, Dec 13, 2011 at 10:07 PM, Chris Barker chris.bar...@noaa.govwrote:

 On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey bsout...@gmail.comwrote:

 **
 Reading data is hard and writing code that suits the diversity in the
 Numerical Python community is even harder!


 yup

 Both loadtxt and genfromtxt functions (other functions are perhaps less
 important) perhaps need an upgrade to incorporate the new NA object.


 yes, if we are satisfiedthat the new NA object is, in fact, the way of the
 future.


 Here I think loadtxt is a better target than genfromtxt because, as I
 understand it, it assumes the user really knows the data. Whereas
 genfromtxt can ask the data for the appropriatye format.

 So I agree that new 'superfast custom CSV reader for well-behaved data'
 function would be rather useful especially as an replacement for loadtxt.
 By that I mean reading data using a user specified format that essentially
 follows the CSV format (
 http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to
 allow for NA object, skipping lines and user-defined delimiters.


 I think that ideally, there could be one interface to reading tabular data
 -- hopefully, it would be easy for the user to specify what the want, and
 if they don't the code tries to figure it out. Also, under the hood, the
 easy cases are special-cased to high-performing versions.

 genfromtxt sure looks close for an API


This I don't agree with. It has a huge amount of keywords that just confuse
or intimidate a beginning user. There should be a dead simple interface,
even the loadtxt API is on the heavy side.

Ralf



 -- it just needs the high performance special cases under the hood. It
 may be that the way it's designed makes it very difficult to do that,
 though -- I haven't looked closely enough to tell.

 At least that's what I'm thinking at the moment.

 -Chris



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Wes McKinney
On Mon, Dec 12, 2011 at 12:34 PM, Warren Weckesser
warren.weckes...@enthought.com wrote:


 On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker chris.bar...@noaa.gov
 wrote:

 On 12/11/11 8:40 AM, Ralf Gommers wrote:
  On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov
      * If we have a good, fast ascii (or unicode?) to array reader,
  hopefully
      it could be leveraged for use in the more complex cases. So that
  rather
      than genfromtxt() being written from scratch, it would be a wrapper
      around the lower-level reader.
 
  You seem to be contradicting yourself here. The more complex cases are
  Wes' 10% and why genfromtxt is so hairy internally. There's always a
  trade-off between speed and handling complex corner cases. You want
  both.

 I don't think the version in my mind is contradictory (Not quite).

 What I'm imagining is that a good, fast ascii to numpy array reader
 could read a whole table in at once (the common, easy, fast, case), but
 it could also be used to read snippets of a file in at a time, which
 could be leveraged to handle many of the more complex cases.

 I suppose there will always be cases where the user needs to write their
 own converter from string to dtype, and there is simply no way to
 leverage what I'm imagining to supported that.

 Hmm, maybe there is -- for instance, if a record consisted off mostly
 standard, easy-to-parse, numbers, but one field was some weird text that
 needed custom parsing, we could read it as a dtype, with a string for
 that one weird field, and that could be converted in a post-processing
 step.

 Maybe that wouldn't be any faster or easier, but it could be done...

 Anyway, whether you can leverage it for the full-featured version or
 not, I do think there is call for a good, fast, 90% case text file parser.


 Would anyone like to join/form a small working group to work on this?

 Wes, I'd like to see your Cython version -- maybe a starting point?

 -Chris



 I'm also working on a faster text file reader, so count me in.  I've been
 experimenting in both C and Cython.   I'll put it on github as soon as I
 can.

 Warren





 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


Cool, Warren, I look forward to seeing it. I'm hopeful we can craft a
performant tool that will meet the needs of of many projects (NumPy,
pandas, etc.)...
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-12 Thread Chris.Barker
On 12/11/11 8:40 AM, Ralf Gommers wrote:
 On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov
 * If we have a good, fast ascii (or unicode?) to array reader, hopefully
 it could be leveraged for use in the more complex cases. So that rather
 than genfromtxt() being written from scratch, it would be a wrapper
 around the lower-level reader.

 You seem to be contradicting yourself here. The more complex cases are
 Wes' 10% and why genfromtxt is so hairy internally. There's always a
 trade-off between speed and handling complex corner cases. You want both.

I don't think the version in my mind is contradictory (Not quite).

What I'm imagining is that a good, fast ascii to numpy array reader 
could read a whole table in at once (the common, easy, fast, case), but 
it could also be used to read snippets of a file in at a time, which 
could be leveraged to handle many of the more complex cases.

I suppose there will always be cases where the user needs to write their 
own converter from string to dtype, and there is simply no way to 
leverage what I'm imagining to supported that.

Hmm, maybe there is -- for instance, if a record consisted off mostly 
standard, easy-to-parse, numbers, but one field was some weird text that 
needed custom parsing, we could read it as a dtype, with a string for 
that one weird field, and that could be converted in a post-processing step.

Maybe that wouldn't be any faster or easier, but it could be done...

Anyway, whether you can leverage it for the full-featured version or 
not, I do think there is call for a good, fast, 90% case text file parser.


Would anyone like to join/form a small working group to work on this?

Wes, I'd like to see your Cython version -- maybe a starting point?

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-12 Thread Warren Weckesser
On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker chris.bar...@noaa.govwrote:

 On 12/11/11 8:40 AM, Ralf Gommers wrote:
  On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov
  * If we have a good, fast ascii (or unicode?) to array reader,
 hopefully
  it could be leveraged for use in the more complex cases. So that
 rather
  than genfromtxt() being written from scratch, it would be a wrapper
  around the lower-level reader.
 
  You seem to be contradicting yourself here. The more complex cases are
  Wes' 10% and why genfromtxt is so hairy internally. There's always a
  trade-off between speed and handling complex corner cases. You want both.

 I don't think the version in my mind is contradictory (Not quite).

 What I'm imagining is that a good, fast ascii to numpy array reader
 could read a whole table in at once (the common, easy, fast, case), but
 it could also be used to read snippets of a file in at a time, which
 could be leveraged to handle many of the more complex cases.

 I suppose there will always be cases where the user needs to write their
 own converter from string to dtype, and there is simply no way to
 leverage what I'm imagining to supported that.

 Hmm, maybe there is -- for instance, if a record consisted off mostly
 standard, easy-to-parse, numbers, but one field was some weird text that
 needed custom parsing, we could read it as a dtype, with a string for
 that one weird field, and that could be converted in a post-processing
 step.

 Maybe that wouldn't be any faster or easier, but it could be done...

 Anyway, whether you can leverage it for the full-featured version or
 not, I do think there is call for a good, fast, 90% case text file parser.


 Would anyone like to join/form a small working group to work on this?

 Wes, I'd like to see your Cython version -- maybe a starting point?

 -Chris



I'm also working on a faster text file reader, so count me in.  I've been
experimenting in both C and Cython.   I'll put it on github as soon as I
can.

Warren





 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-11 Thread Ralf Gommers
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov wrote:

 Hi folks,

 This is a continuation of a conversation already started, but i gave it
 a new, more appropriate, thread and subject.

 On 12/6/11 2:13 PM, Wes McKinney wrote:
  we should start talking
  about building a *high performance* flat file loading solution with
  good column type inference and sensible defaults, etc.
 ...

   I personally don't
  believe in sacrificing an order of magnitude of performance in the 90%
  case for the 10% case-- so maybe it makes sense to have two functions
  around: a superfast custom CSV reader for well-behaved data, and a
  slower, but highly flexible, function like loadtable to fall back on.

 I've wanted this for ages, and have done some work towards it, but like
 others, only had the time for a my-use-case-specific solution. A few
 thoughts:

 * If we have a good, fast ascii (or unicode?) to array reader, hopefully
 it could be leveraged for use in the more complex cases. So that rather
 than genfromtxt() being written from scratch, it would be a wrapper
 around the lower-level reader.


You seem to be contradicting yourself here. The more complex cases are Wes'
10% and why genfromtxt is so hairy internally. There's always a trade-off
between speed and handling complex corner cases. You want both.

A very fast reader for well-behave files would be very welcome, but I see
it as a separate topic from genfromtxt/loadtable. The question for the
loadtable pull request is whether it is different enough from genfromtxt
that we need/want both, or whether loadtable should replace genfromtxt.

Cheers,
Ralf



 * key to performance is to have the text to number to numpy type
 happening in C -- if you read the text with python, then convert to
 numbers, then to numpy arrays, it's simple going to be slow.

 * I think we want a solution that can be adapted to arbitrary text files
 -- not just tabular, CSV-style data. I have a lot of those to read - and
 some thoughts about how.

 Efforts I have made so far, and what I've learned from them:

 1) fromfile():
 fromfile (for text) is nice and fast, but buggy, and a bit too
 limited. I've posted various notes about this in the past (and, I'm
 pretty sure a couple tickets). They key missing features are:
   a) no support form commented lines (this is a lessor need, I think)
   b) there can be only one delimiter, and newlines are treated as
 generic whitespace. What this means is that if you have
 whitespace-delimited file, you can read multiple lines, but if it is,
 for instance, comma-delimited, then you can only read one line at a
 time, killing performance.
   c) there are various bugs if the text is malformed, or doesn't quite
 match what you're asking for (ie.e reading integers, but the tet is
 float) -- mostly really limited error checking.

 I spent some time digging into the code, and found it to be really hard
 to track C code. And very hard to update. The core idea is pretty nice
 -- each dtype should know how to read itself form a text file, but the
 implementation is painful. The key issue is that for floats and ints,
 anyway, it relies on the C atoi and atof functions. However, there have
 been patches to these that handle NaN better, etc, for numpy, and I
 think a python patch as well. So the code calls the numpy atoi, which
 does some checks, then calls the python atoi, which then calls the C lib
 atoi (I think all that...) In any case, the core bugs are due to the
 fact that atoi and friends doesn't return an error code, so you have to
 check if the pointer has been incremented to see if the read was
 successful -- this error checking is not propagated through all those
 levels of calls. It got really ugly to try to fix! Also, the use of the
 C atoi() means that locales may only be handled in the default way --
 i.e. no way to read european-style floats on a system with a US locale.

 My conclusion -- the current code is too much a mess to try to deal with
 and fix!

 I also think it's a mistake to have text file reading a special case of
 fromfile(), it really should be a separate issue, though that's a minor
 API question.

 2) FileScanner:

 FileScanner is some code a wrote years ago as a C extension - it's
 limited, but does the job and is pretty fast. It essentially calls
 fscanf() as many times as it gets a successful scan, skipping all
 invalid text, then returning a numpy array. You can also specify how
 many numbers you want read from the file. It only supports floats.
 Travis O. asked it it could be included in Scipy way back when, but I
 suspect none of my code actually made it in.

 If I had to do it again, I might write something similar in Cython,
 though I am still using it.


 My Conclusions:

 I think what we need is something similar to MATLAB's fscanf():

 what it does is take a C-style format string, and apply it to your file
 over an over again as many times as it can, and returns an array. What's
 nice about this is 

[Numpy-discussion] Fast Reading of ASCII files

2011-12-07 Thread Chris.Barker
Hi folks,

This is a continuation of a conversation already started, but i gave it 
a new, more appropriate, thread and subject.

On 12/6/11 2:13 PM, Wes McKinney wrote:
 we should start talking
 about building a *high performance* flat file loading solution with
 good column type inference and sensible defaults, etc.
...

  I personally don't
 believe in sacrificing an order of magnitude of performance in the 90%
 case for the 10% case-- so maybe it makes sense to have two functions
 around: a superfast custom CSV reader for well-behaved data, and a
 slower, but highly flexible, function like loadtable to fall back on.

I've wanted this for ages, and have done some work towards it, but like 
others, only had the time for a my-use-case-specific solution. A few 
thoughts:

* If we have a good, fast ascii (or unicode?) to array reader, hopefully 
it could be leveraged for use in the more complex cases. So that rather 
than genfromtxt() being written from scratch, it would be a wrapper 
around the lower-level reader.

* key to performance is to have the text to number to numpy type 
happening in C -- if you read the text with python, then convert to 
numbers, then to numpy arrays, it's simple going to be slow.

* I think we want a solution that can be adapted to arbitrary text files 
-- not just tabular, CSV-style data. I have a lot of those to read - and 
some thoughts about how.

Efforts I have made so far, and what I've learned from them:

1) fromfile():
 fromfile (for text) is nice and fast, but buggy, and a bit too 
limited. I've posted various notes about this in the past (and, I'm 
pretty sure a couple tickets). They key missing features are:
   a) no support form commented lines (this is a lessor need, I think)
   b) there can be only one delimiter, and newlines are treated as 
generic whitespace. What this means is that if you have 
whitespace-delimited file, you can read multiple lines, but if it is, 
for instance, comma-delimited, then you can only read one line at a 
time, killing performance.
   c) there are various bugs if the text is malformed, or doesn't quite 
match what you're asking for (ie.e reading integers, but the tet is 
float) -- mostly really limited error checking.

I spent some time digging into the code, and found it to be really hard 
to track C code. And very hard to update. The core idea is pretty nice 
-- each dtype should know how to read itself form a text file, but the 
implementation is painful. The key issue is that for floats and ints, 
anyway, it relies on the C atoi and atof functions. However, there have 
been patches to these that handle NaN better, etc, for numpy, and I 
think a python patch as well. So the code calls the numpy atoi, which 
does some checks, then calls the python atoi, which then calls the C lib 
atoi (I think all that...) In any case, the core bugs are due to the 
fact that atoi and friends doesn't return an error code, so you have to 
check if the pointer has been incremented to see if the read was 
successful -- this error checking is not propagated through all those 
levels of calls. It got really ugly to try to fix! Also, the use of the 
C atoi() means that locales may only be handled in the default way -- 
i.e. no way to read european-style floats on a system with a US locale.

My conclusion -- the current code is too much a mess to try to deal with 
and fix!

I also think it's a mistake to have text file reading a special case of 
fromfile(), it really should be a separate issue, though that's a minor 
API question.

2) FileScanner:

FileScanner is some code a wrote years ago as a C extension - it's 
limited, but does the job and is pretty fast. It essentially calls 
fscanf() as many times as it gets a successful scan, skipping all 
invalid text, then returning a numpy array. You can also specify how 
many numbers you want read from the file. It only supports floats. 
Travis O. asked it it could be included in Scipy way back when, but I 
suspect none of my code actually made it in.

If I had to do it again, I might write something similar in Cython, 
though I am still using it.


My Conclusions:

I think what we need is something similar to MATLAB's fscanf():

what it does is take a C-style format string, and apply it to your file 
over an over again as many times as it can, and returns an array. What's 
nice about this is that it can be purposed to efficiently read a wide 
variety of text files fast.

For numpy, I imagine something like:

fromtextfile(f, dtype=np.float64, comment=None, shape=None):

read data from a text file, returning a numpy array

f: is a filename or file-like object

comment: is a string of the comment signifier. Anything on a line
 after this string will be ignored.

dytpe: is a numpy dtype that you want read from the file

shape: is the shape of the resulting array. If shape==None, the
   file will be read until EOF or until there is read error.
   By default,