[Numpy-discussion] Fast Reading of ASCII files

2011-12-07 Thread Chris.Barker
Hi folks,

This is a continuation of a conversation already started, but i gave it 
a new, more appropriate, thread and subject.

On 12/6/11 2:13 PM, Wes McKinney wrote:
> we should start talking
> about building a *high performance* flat file loading solution with
> good column type inference and sensible defaults, etc.
...

>  I personally don't
> believe in sacrificing an order of magnitude of performance in the 90%
> case for the 10% case-- so maybe it makes sense to have two functions
> around: a superfast custom CSV reader for well-behaved data, and a
> slower, but highly flexible, function like loadtable to fall back on.

I've wanted this for ages, and have done some work towards it, but like 
others, only had the time for a my-use-case-specific solution. A few 
thoughts:

* If we have a good, fast ascii (or unicode?) to array reader, hopefully 
it could be leveraged for use in the more complex cases. So that rather 
than genfromtxt() being written from scratch, it would be a wrapper 
around the lower-level reader.

* key to performance is to have the text to number to numpy type 
happening in C -- if you read the text with python, then convert to 
numbers, then to numpy arrays, it's simple going to be slow.

* I think we want a solution that can be adapted to arbitrary text files 
-- not just tabular, CSV-style data. I have a lot of those to read - and 
some thoughts about how.

Efforts I have made so far, and what I've learned from them:

1) fromfile():
 fromfile (for text) is nice and fast, but buggy, and a bit too 
limited. I've posted various notes about this in the past (and, I'm 
pretty sure a couple tickets). They key missing features are:
   a) no support form commented lines (this is a lessor need, I think)
   b) there can be only one delimiter, and newlines are treated as 
generic whitespace. What this means is that if you have 
whitespace-delimited file, you can read multiple lines, but if it is, 
for instance, comma-delimited, then you can only read one line at a 
time, killing performance.
   c) there are various bugs if the text is malformed, or doesn't quite 
match what you're asking for (ie.e reading integers, but the tet is 
float) -- mostly really limited error checking.

I spent some time digging into the code, and found it to be really hard 
to track C code. And very hard to update. The core idea is pretty nice 
-- each dtype should know how to read itself form a text file, but the 
implementation is painful. The key issue is that for floats and ints, 
anyway, it relies on the C atoi and atof functions. However, there have 
been patches to these that handle NaN better, etc, for numpy, and I 
think a python patch as well. So the code calls the numpy atoi, which 
does some checks, then calls the python atoi, which then calls the C lib 
atoi (I think all that...) In any case, the core bugs are due to the 
fact that atoi and friends doesn't return an error code, so you have to 
check if the pointer has been incremented to see if the read was 
successful -- this error checking is not propagated through all those 
levels of calls. It got really ugly to try to fix! Also, the use of the 
C atoi() means that locales may only be handled in the default way -- 
i.e. no way to read european-style floats on a system with a US locale.

My conclusion -- the current code is too much a mess to try to deal with 
and fix!

I also think it's a mistake to have text file reading a special case of 
fromfile(), it really should be a separate issue, though that's a minor 
API question.

2) FileScanner:

FileScanner is some code a wrote years ago as a C extension - it's 
limited, but does the job and is pretty fast. It essentially calls 
fscanf() as many times as it gets a successful scan, skipping all 
invalid text, then returning a numpy array. You can also specify how 
many numbers you want read from the file. It only supports floats. 
Travis O. asked it it could be included in Scipy way back when, but I 
suspect none of my code actually made it in.

If I had to do it again, I might write something similar in Cython, 
though I am still using it.


My Conclusions:

I think what we need is something similar to MATLAB's fscanf():

what it does is take a C-style format string, and apply it to your file 
over an over again as many times as it can, and returns an array. What's 
nice about this is that it can be purposed to efficiently read a wide 
variety of text files fast.

For numpy, I imagine something like:

fromtextfile(f, dtype=np.float64, comment=None, shape=None):
"""
read data from a text file, returning a numpy array

f: is a filename or file-like object

comment: is a string of the comment signifier. Anything on a line
 after this string will be ignored.

dytpe: is a numpy dtype that you want read from the file

shape: is the shape of the resulting array. If shape==None, the
   file will be read until EOF or until there is read error.
   By 

Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-11 Thread Ralf Gommers
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker  wrote:

> Hi folks,
>
> This is a continuation of a conversation already started, but i gave it
> a new, more appropriate, thread and subject.
>
> On 12/6/11 2:13 PM, Wes McKinney wrote:
> > we should start talking
> > about building a *high performance* flat file loading solution with
> > good column type inference and sensible defaults, etc.
> ...
>
> >  I personally don't
> > believe in sacrificing an order of magnitude of performance in the 90%
> > case for the 10% case-- so maybe it makes sense to have two functions
> > around: a superfast custom CSV reader for well-behaved data, and a
> > slower, but highly flexible, function like loadtable to fall back on.
>
> I've wanted this for ages, and have done some work towards it, but like
> others, only had the time for a my-use-case-specific solution. A few
> thoughts:
>
> * If we have a good, fast ascii (or unicode?) to array reader, hopefully
> it could be leveraged for use in the more complex cases. So that rather
> than genfromtxt() being written from scratch, it would be a wrapper
> around the lower-level reader.
>

You seem to be contradicting yourself here. The more complex cases are Wes'
10% and why genfromtxt is so hairy internally. There's always a trade-off
between speed and handling complex corner cases. You want both.

A very fast reader for well-behave files would be very welcome, but I see
it as a separate topic from genfromtxt/loadtable. The question for the
loadtable pull request is whether it is different enough from genfromtxt
that we need/want both, or whether loadtable should replace genfromtxt.

Cheers,
Ralf



> * key to performance is to have the text to number to numpy type
> happening in C -- if you read the text with python, then convert to
> numbers, then to numpy arrays, it's simple going to be slow.
>
> * I think we want a solution that can be adapted to arbitrary text files
> -- not just tabular, CSV-style data. I have a lot of those to read - and
> some thoughts about how.
>
> Efforts I have made so far, and what I've learned from them:
>
> 1) fromfile():
> fromfile (for text) is nice and fast, but buggy, and a bit too
> limited. I've posted various notes about this in the past (and, I'm
> pretty sure a couple tickets). They key missing features are:
>   a) no support form commented lines (this is a lessor need, I think)
>   b) there can be only one delimiter, and newlines are treated as
> generic whitespace. What this means is that if you have
> whitespace-delimited file, you can read multiple lines, but if it is,
> for instance, comma-delimited, then you can only read one line at a
> time, killing performance.
>   c) there are various bugs if the text is malformed, or doesn't quite
> match what you're asking for (ie.e reading integers, but the tet is
> float) -- mostly really limited error checking.
>
> I spent some time digging into the code, and found it to be really hard
> to track C code. And very hard to update. The core idea is pretty nice
> -- each dtype should know how to read itself form a text file, but the
> implementation is painful. The key issue is that for floats and ints,
> anyway, it relies on the C atoi and atof functions. However, there have
> been patches to these that handle NaN better, etc, for numpy, and I
> think a python patch as well. So the code calls the numpy atoi, which
> does some checks, then calls the python atoi, which then calls the C lib
> atoi (I think all that...) In any case, the core bugs are due to the
> fact that atoi and friends doesn't return an error code, so you have to
> check if the pointer has been incremented to see if the read was
> successful -- this error checking is not propagated through all those
> levels of calls. It got really ugly to try to fix! Also, the use of the
> C atoi() means that locales may only be handled in the default way --
> i.e. no way to read european-style floats on a system with a US locale.
>
> My conclusion -- the current code is too much a mess to try to deal with
> and fix!
>
> I also think it's a mistake to have text file reading a special case of
> fromfile(), it really should be a separate issue, though that's a minor
> API question.
>
> 2) FileScanner:
>
> FileScanner is some code a wrote years ago as a C extension - it's
> limited, but does the job and is pretty fast. It essentially calls
> fscanf() as many times as it gets a successful scan, skipping all
> invalid text, then returning a numpy array. You can also specify how
> many numbers you want read from the file. It only supports floats.
> Travis O. asked it it could be included in Scipy way back when, but I
> suspect none of my code actually made it in.
>
> If I had to do it again, I might write something similar in Cython,
> though I am still using it.
>
>
> My Conclusions:
>
> I think what we need is something similar to MATLAB's fscanf():
>
> what it does is take a C-style format string, and apply it to your file
> over an over ag

Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-12 Thread Chris.Barker
On 12/11/11 8:40 AM, Ralf Gommers wrote:
> On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker  * If we have a good, fast ascii (or unicode?) to array reader, hopefully
> it could be leveraged for use in the more complex cases. So that rather
> than genfromtxt() being written from scratch, it would be a wrapper
> around the lower-level reader.
>
> You seem to be contradicting yourself here. The more complex cases are
> Wes' 10% and why genfromtxt is so hairy internally. There's always a
> trade-off between speed and handling complex corner cases. You want both.

I don't think the version in my mind is contradictory (Not quite).

What I'm imagining is that a good, fast ascii to numpy array reader 
could read a whole table in at once (the common, easy, fast, case), but 
it could also be used to read snippets of a file in at a time, which 
could be leveraged to handle many of the more complex cases.

I suppose there will always be cases where the user needs to write their 
own converter from string to dtype, and there is simply no way to 
leverage what I'm imagining to supported that.

Hmm, maybe there is -- for instance, if a "record" consisted off mostly 
standard, easy-to-parse, numbers, but one field was some weird text that 
needed custom parsing, we could read it as a dtype, with a string for 
that one weird field, and that could be converted in a post-processing step.

Maybe that wouldn't be any faster or easier, but it could be done...

Anyway, whether you can leverage it for the full-featured version or 
not, I do think there is call for a good, fast, 90% case text file parser.


Would anyone like to join/form a small working group to work on this?

Wes, I'd like to see your Cython version -- maybe a starting point?

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-12 Thread Warren Weckesser
On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker wrote:

> On 12/11/11 8:40 AM, Ralf Gommers wrote:
> > On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker  > * If we have a good, fast ascii (or unicode?) to array reader,
> hopefully
> > it could be leveraged for use in the more complex cases. So that
> rather
> > than genfromtxt() being written from scratch, it would be a wrapper
> > around the lower-level reader.
> >
> > You seem to be contradicting yourself here. The more complex cases are
> > Wes' 10% and why genfromtxt is so hairy internally. There's always a
> > trade-off between speed and handling complex corner cases. You want both.
>
> I don't think the version in my mind is contradictory (Not quite).
>
> What I'm imagining is that a good, fast ascii to numpy array reader
> could read a whole table in at once (the common, easy, fast, case), but
> it could also be used to read snippets of a file in at a time, which
> could be leveraged to handle many of the more complex cases.
>
> I suppose there will always be cases where the user needs to write their
> own converter from string to dtype, and there is simply no way to
> leverage what I'm imagining to supported that.
>
> Hmm, maybe there is -- for instance, if a "record" consisted off mostly
> standard, easy-to-parse, numbers, but one field was some weird text that
> needed custom parsing, we could read it as a dtype, with a string for
> that one weird field, and that could be converted in a post-processing
> step.
>
> Maybe that wouldn't be any faster or easier, but it could be done...
>
> Anyway, whether you can leverage it for the full-featured version or
> not, I do think there is call for a good, fast, 90% case text file parser.
>
>
> Would anyone like to join/form a small working group to work on this?
>
> Wes, I'd like to see your Cython version -- maybe a starting point?
>
> -Chris
>


I'm also working on a faster text file reader, so count me in.  I've been
experimenting in both C and Cython.   I'll put it on github as soon as I
can.

Warren



>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Chris Barker
NOTE:

Let's keep this on the list.

On Tue, Dec 13, 2011 at 9:19 AM, denis  wrote:

> Chris,
>  unified, consistent save / load is a nice goal
>
> 1) header lines with date, pwd etc.: "where'd this come from ?"
>
># (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
># 80.6 % correct -- user info
>  24539 4 526
>...
>
I'm not sure I understand what you are expecting here: What would be
automatic? if itparses a datetime on the header, what would it do with it?
But anyway, this seems to me:
  - very application specific -- this is for the users code to write
  - not what we are talking about at this point anyway -- I think this
discussion is about a lower-level, does-the-simple-things-fast reader --
that may or may not be able to form the basis of a higher-level fuller
featured reader.


> 2) read any CSVs: comma or blank-delimited, with/without column names,
>a la loadcsv() below
>

yup -- though the column name reading would be part of a higher-level
reader as far as I'm concerned.


> 3) sparse or masked arrays ?
>
> sparse probably not, that seem pretty domain dependent to me -- though
hopefully one could build such a thing on top of the lower level reader.
 Masked support would be good -- once we're convinced what the future of
masked arrays are in numpy. I was thinking that the masked array issue
would really be a higher-level feature -- it certainly could be if you need
to mask "special value" stype files (i.e. ), but we may have to build
it into the lower level reader for cases where the mask is specified by
non-numerical values -- i.e. there are some met files that use "MM" or some
other text, so you can't put it into a numerical array first.

>
> Longterm wishes: beyond the scope of one file <-> one array
> but essential for larger projects:
> 1) dicts / dotdicts:
>Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little
> files
>is easy, better than np.savez
>(Haven't used hdf5, I believe Matlabv7  does.)
>
> 2) workflows: has anyone there used visTrails ?
>

outside of the spec of this thread...

>
> Anyway it seems to me (old grey cynic) that Numpy/scipy developers
> prefer to code first, spec and doc later. Too pessimistic ?
>
>
Well, I think many of us believe in a more agile style approach --
incremental development. But really, as an open source project, it's really
about scratching an itch -- so there is usually a spec in mind for the itch
at hand. In this case, however, that has been a weakness -- clearly a
number of us hav written small solutions to our particular problem at hand,
but no we haven't arrived at a more general purpose solution yet. So a bit
of spec-ing ahead of time may be called for.

On that:

I"ve been thinking from teh botom-up -- imaging what I need for the simple
case, and how it might apply to more complex cases -- but maybe we should
think about this another way:

What we're talking about here is really about core software engineering --
optimization. It's easy to write a pure-python simple file parser, and
reasonable to write a complex one (genfromtxt) -- the issue is performance
-- we need some more C (or Cython) code to really speed it up, but none of
us wants to write the complex case code in C. So:

genfromtxt is really nice for many of the complex cases. So perhaps
another approach is to look at genfromtxt, and see what
high performance lower-level functionality we could develop that could make
it fast -- then we are done.

This actually mirrors exactly what we all usually recommend for python
development in general -- write it in Python, then, if it's really not fast
enough, write the bottle-neck in C.

So where are the bottle necks in genfromtxt? Are there self-contained
portions that could be re-written in C/Cython?

-Chris






-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Bruce Southey

On 12/13/2011 12:08 PM, Chris Barker wrote:

NOTE:

Let's keep this on the list.

On Tue, Dec 13, 2011 at 9:19 AM, denis > wrote:


Chris,
 unified, consistent save / load is a nice goal

1) header lines with date, pwd etc.: "where'd this come from ?"

   # (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
   # 80.6 % correct -- user info
 24539 4 526
   ...

I'm not sure I understand what you are expecting here: What would be 
automatic? if itparses a datetime on the header, what would it do with 
it? But anyway, this seems to me:

  - very application specific -- this is for the users code to write
  - not what we are talking about at this point anyway -- I think this 
discussion is about a lower-level, does-the-simple-things-fast reader 
-- that may or may not be able to form the basis of a higher-level 
fuller featured reader.


2) read any CSVs: comma or blank-delimited, with/without column names,
   a la loadcsv() below


yup -- though the column name reading would be part of a higher-level 
reader as far as I'm concerned.


3) sparse or masked arrays ?

sparse probably not, that seem pretty domain dependent to me -- though 
hopefully one could build such a thing on top of the lower level 
reader.  Masked support would be good -- once we're convinced what the 
future of masked arrays are in numpy. I was thinking that the masked 
array issue would really be a higher-level feature -- it certainly 
could be if you need to mask "special value" stype files (i.e. ), 
but we may have to build it into the lower level reader for cases 
where the mask is specified by non-numerical values -- i.e. there are 
some met files that use "MM" or some other text, so you can't put it 
into a numerical array first.



Longterm wishes: beyond the scope of one file <-> one array
but essential for larger projects:
1) dicts / dotdicts:
   Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little
files
   is easy, better than np.savez
   (Haven't used hdf5, I believe Matlabv7  does.)

2) workflows: has anyone there used visTrails ?


outside of the spec of this thread...


Anyway it seems to me (old grey cynic) that Numpy/scipy developers
prefer to code first, spec and doc later. Too pessimistic ?


Well, I think many of us believe in a more agile style approach -- 
incremental development. But really, as an open source project, it's 
really about scratching an itch -- so there is usually a spec in mind 
for the itch at hand. In this case, however, that has been a weakness 
-- clearly a number of us hav written small solutions to 
our particular problem at hand, but no we haven't arrived at a more 
general purpose solution yet. So a bit of spec-ing ahead of time may 
be called for.


On that:

I"ve been thinking from teh botom-up -- imaging what I need for the 
simple case, and how it might apply to more complex cases -- but maybe 
we should think about this another way:


What we're talking about here is really about core software 
engineering -- optimization. It's easy to write a pure-python simple 
file parser, and reasonable to write a complex one (genfromtxt) -- the 
issue is performance -- we need some more C (or Cython) code to really 
speed it up, but none of us wants to write the complex case code in C. So:


genfromtxt is really nice for many of the complex cases. So perhaps 
another approach is to look at genfromtxt, and see what 
high performance lower-level functionality we could develop that could 
make it fast -- then we are done.


This actually mirrors exactly what we all usually recommend for python 
development in general -- write it in Python, then, if it's really not 
fast enough, write the bottle-neck in C.


So where are the bottle necks in genfromtxt? Are there self-contained 
portions that could be re-written in C/Cython?


-Chris



Reading data is hard and writing code that suits the diversity in the 
Numerical Python community is even harder!


Both loadtxt and genfromtxt functions (other functions are perhaps less 
important) perhaps need an upgrade to incorporate the new NA object. I 
think that adding the NA object will simply some of the process because 
invalid data (missing or a string in a numerical format) can be set to 
NA without requiring the creation of a  new masked array or returning an 
error.


Here I think loadtxt is a better target than genfromtxt because, as I 
understand it, it assumes the user really knows the data. Whereas 
genfromtxt can ask the data for the appropriatye format.


So I agree that new 'superfast custom CSV reader for well-behaved data' 
function would be rather useful especially as an replacement for 
loadtxt. By that I mean reading data using a user specified format that 
essentially follows the CSV format 
(http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to 
allow for NA object, sk

Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Chris Barker
On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey  wrote:

> **
> Reading data is hard and writing code that suits the diversity in the
> Numerical Python community is even harder!
>
>
yup

Both loadtxt and genfromtxt functions (other functions are perhaps less
> important) perhaps need an upgrade to incorporate the new NA object.
>

yes, if we are satisfiedthat the new NA object is, in fact, the way of the
future.


> Here I think loadtxt is a better target than genfromtxt because, as I
> understand it, it assumes the user really knows the data. Whereas
> genfromtxt can ask the data for the appropriatye format.
>
> So I agree that new 'superfast custom CSV reader for well-behaved data'
> function would be rather useful especially as an replacement for loadtxt.
> By that I mean reading data using a user specified format that essentially
> follows the CSV format (
> http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to
> allow for NA object, skipping lines and user-defined delimiters.
>
>
I think that ideally, there could be one interface to reading tabular data
-- hopefully, it would be easy for the user to specify what the want, and
if they don't the code tries to figure it out. Also, under the hood, the
"easy" cases are special-cased to high-performing versions.

genfromtxt sure looks close for an API -- it just needs the "high
performance special cases" under the hood. It may be that the way it's
designed makes it very difficult to do that, though -- I haven't looked
closely enough to tell.

At least that's what I'm thinking at the moment.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Ralf Gommers
On Tue, Dec 13, 2011 at 10:07 PM, Chris Barker wrote:

> On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey wrote:
>
>> **
>> Reading data is hard and writing code that suits the diversity in the
>> Numerical Python community is even harder!
>>
>>
> yup
>
> Both loadtxt and genfromtxt functions (other functions are perhaps less
>> important) perhaps need an upgrade to incorporate the new NA object.
>>
>
> yes, if we are satisfiedthat the new NA object is, in fact, the way of the
> future.
>
>
>> Here I think loadtxt is a better target than genfromtxt because, as I
>> understand it, it assumes the user really knows the data. Whereas
>> genfromtxt can ask the data for the appropriatye format.
>>
>> So I agree that new 'superfast custom CSV reader for well-behaved data'
>> function would be rather useful especially as an replacement for loadtxt.
>> By that I mean reading data using a user specified format that essentially
>> follows the CSV format (
>> http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to
>> allow for NA object, skipping lines and user-defined delimiters.
>>
>>
> I think that ideally, there could be one interface to reading tabular data
> -- hopefully, it would be easy for the user to specify what the want, and
> if they don't the code tries to figure it out. Also, under the hood, the
> "easy" cases are special-cased to high-performing versions.
>
> genfromtxt sure looks close for an API
>

This I don't agree with. It has a huge amount of keywords that just confuse
or intimidate a beginning user. There should be a dead simple interface,
even the loadtxt API is on the heavy side.

Ralf



> -- it just needs the "high performance special cases" under the hood. It
> may be that the way it's designed makes it very difficult to do that,
> though -- I haven't looked closely enough to tell.
>
> At least that's what I'm thinking at the moment.
>
> -Chris
>
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Wes McKinney
On Mon, Dec 12, 2011 at 12:34 PM, Warren Weckesser
 wrote:
>
>
> On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker 
> wrote:
>>
>> On 12/11/11 8:40 AM, Ralf Gommers wrote:
>> > On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker > >     * If we have a good, fast ascii (or unicode?) to array reader,
>> > hopefully
>> >     it could be leveraged for use in the more complex cases. So that
>> > rather
>> >     than genfromtxt() being written from scratch, it would be a wrapper
>> >     around the lower-level reader.
>> >
>> > You seem to be contradicting yourself here. The more complex cases are
>> > Wes' 10% and why genfromtxt is so hairy internally. There's always a
>> > trade-off between speed and handling complex corner cases. You want
>> > both.
>>
>> I don't think the version in my mind is contradictory (Not quite).
>>
>> What I'm imagining is that a good, fast ascii to numpy array reader
>> could read a whole table in at once (the common, easy, fast, case), but
>> it could also be used to read snippets of a file in at a time, which
>> could be leveraged to handle many of the more complex cases.
>>
>> I suppose there will always be cases where the user needs to write their
>> own converter from string to dtype, and there is simply no way to
>> leverage what I'm imagining to supported that.
>>
>> Hmm, maybe there is -- for instance, if a "record" consisted off mostly
>> standard, easy-to-parse, numbers, but one field was some weird text that
>> needed custom parsing, we could read it as a dtype, with a string for
>> that one weird field, and that could be converted in a post-processing
>> step.
>>
>> Maybe that wouldn't be any faster or easier, but it could be done...
>>
>> Anyway, whether you can leverage it for the full-featured version or
>> not, I do think there is call for a good, fast, 90% case text file parser.
>>
>>
>> Would anyone like to join/form a small working group to work on this?
>>
>> Wes, I'd like to see your Cython version -- maybe a starting point?
>>
>> -Chris
>
>
>
> I'm also working on a faster text file reader, so count me in.  I've been
> experimenting in both C and Cython.   I'll put it on github as soon as I
> can.
>
> Warren
>
>
>>
>>
>>
>> --
>> Christopher Barker, Ph.D.
>> Oceanographer
>>
>> Emergency Response Division
>> NOAA/NOS/OR&R            (206) 526-6959   voice
>> 7600 Sand Point Way NE   (206) 526-6329   fax
>> Seattle, WA  98115       (206) 526-6317   main reception
>>
>> chris.bar...@noaa.gov
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Cool, Warren, I look forward to seeing it. I'm hopeful we can craft a
performant tool that will meet the needs of of many projects (NumPy,
pandas, etc.)...
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-13 Thread Chris Barker
On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers
wrote:

>
> genfromtxt sure looks close for an API
>>
>
> This I don't agree with. It has a huge amount of keywords that just
> confuse or intimidate a beginning user. There should be a dead simple
> interface, even the loadtxt API is on the heavy side.
>

well, yes, though it does do a lot -- do you have a smpler one in mind?

But anyway, the really simple cases, are reallly simle, even with
genfromtxt.

I guess it's a matter of debate about what is a better API:

a few functions, each adding a layer of sophistication

or

one function, with layers of sophistication added with an array of keyword
arguments.

In either case, though I wish the multiple functionality built on the same,
well optimized core code.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Bruce Southey

On 12/14/2011 01:03 AM, Chris Barker wrote:



On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers 
mailto:ralf.gomm...@googlemail.com>> wrote:



genfromtxt sure looks close for an API


This I don't agree with. It has a huge amount of keywords that
just confuse or intimidate a beginning user. There should be a
dead simple interface, even the loadtxt API is on the heavy side.


well, yes, though it does do a lot -- do you have a smpler one in mind?

But anyway, the really simple cases, are reallly simle, even with 
genfromtxt.


I guess it's a matter of debate about what is a better API:

a few functions, each adding a layer of sophistication

or

one function, with layers of sophistication added with an array of 
keyword arguments.


In either case, though I wish the multiple functionality built on the 
same, well optimized core code.


-Chris



I am not sure that you can even create a simple API here as even 
Python's csv module is rather complex especially when it just reads data 
as strings. It also 'hides' many arguments in the Dialect class although 
these are just the collection of 7 'fmtparam' arguments. It also 
provides the Sniffer class that tries to find correct format that can 
then be passed to the reader function. Then you still have to convert 
the data into the required types - another set of arguments as well as 
yet another pass through the data.


In comparison, genfromtxt can perform sniffing and both genfromtxt and 
loadtxt can read and convert the data. These also add some useful 
features like skipping rows (start, end and commented) and columns. 
However, it could be possible to create a sniffer function and a single 
data reader function leading to a 'simple' reader function but that 
probably would not change the API of the underlying data reader function.


Bruce


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Ralf Gommers
On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey  wrote:

> **
> On 12/14/2011 01:03 AM, Chris Barker wrote:
>
>
>
> On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers  > wrote:
>
>>
>>   genfromtxt sure looks close for an API
>>>
>>
>> This I don't agree with. It has a huge amount of keywords that just
>> confuse or intimidate a beginning user. There should be a dead simple
>> interface, even the loadtxt API is on the heavy side.
>>
>
> well, yes, though it does do a lot -- do you have a smpler one in mind?
>
> Just looking at what I normally wouldn't need for simple data files and/or
what a beginning user won't understand at once, the `unpack` and `ndmin`
keywords could certainly be left out. `converters` is also questionable.
That's probably as simple as it can get.

Note that I don't think this should be changed now, that's not worth the
trouble.

>  But anyway, the really simple cases, are reallly simle, even with
> genfromtxt.
>
> I guess it's a matter of debate about what is a better API:
>
> a few functions, each adding a layer of sophistication
>
> or
>
> one function, with layers of sophistication added with an array of keyword
> arguments.
>
> There's always a trade-off, but looking at the docstring for genfromtxt
should make it an easy call in this case.

>  In either case, though I wish the multiple functionality built on the
> same, well optimized core code.
>
> I wish that too, but I'm fairly certain that you can't write that core
code with the ability to handle missing and irregular data and make it
close to the same speed as an optimized reader for regular data.

  I am not sure that you can even create a simple API here as even Python's
> csv module is rather complex especially when it just reads data as strings.
> It also 'hides' many arguments in the Dialect class although these are just
> the collection of 7 'fmtparam' arguments. It also provides the Sniffer
> class that tries to find correct format that can then be passed to the
> reader function. Then you still have to convert the data into the required
> types - another set of arguments as well as yet another pass through the
> data.
>
> In comparison, genfromtxt can perform sniffing
>

I assume you mean the ``dtype=None`` example in the docstring? That works
to some extent, but you still need to specify the delimiter. I commented on
that on the loadtable PR.


> and both genfromtxt and loadtxt can read and convert the data. These also
> add some useful features like skipping rows (start, end and commented) and
> columns. However, it could be possible to create a sniffer function and a
> single data reader function leading to a 'simple' reader function but that
> probably would not change the API of the underlying data reader function.
>

Better auto-detection of things like delimiters would indeed be quite
useful.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Benjamin Root
On Wed, Dec 14, 2011 at 1:22 PM, Ralf Gommers
wrote:

>
>
> On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey  wrote:
>
>> **
>> On 12/14/2011 01:03 AM, Chris Barker wrote:
>>
>>
>>
>> On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers <
>> ralf.gomm...@googlemail.com> wrote:
>>
>>>
>>>   genfromtxt sure looks close for an API

>>>
>>> This I don't agree with. It has a huge amount of keywords that just
>>> confuse or intimidate a beginning user. There should be a dead simple
>>> interface, even the loadtxt API is on the heavy side.
>>>
>>
>> well, yes, though it does do a lot -- do you have a smpler one in mind?
>>
>> Just looking at what I normally wouldn't need for simple data files
> and/or what a beginning user won't understand at once, the `unpack` and
> `ndmin` keywords could certainly be left out. `converters` is also
> questionable. That's probably as simple as it can get.
>
>
Just my two cents (and I was one of those who championed its inclusion),
the ndmin feature is designed to prevent unexpected results that users
(particularly beginners) may encounter with their datasets.  Now, maybe it
might be difficult to tell a beginner *why* they might need to be aware of
it, but it is very easy to describe *how* to use.  "How many dimensions is
your data? Two? Ok, just set ndmin=2 and you are good to go!"

Cheers!
Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Chris Barker
On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root  wrote:
>>> well, yes, though it does do a lot -- do you have a smpler one in mind?
>>>
>> Just looking at what I normally wouldn't need for simple data files and/or
>> what a beginning user won't understand at once, the `unpack` and `ndmin`
>> keywords could certainly be left out. `converters` is also questionable.
>> That's probably as simple as it can get.

this may be a function of a well written doc string -- if it is clear
to the newbie that " all the rest of this you don't need unless you
have a wierd data file", then extra keyword arguments don't really
hurt.

A few examples of the basic use-cases go a long way.

And yes, the core reader for the complex cases isn't going to e fast
(it's going to be complex C code...). but we could still have a core
reader that handled most cases.

Anyway, I think it's time write code, and see if it can be rolled in somehow...

-Chris

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Reading of ASCII files

2011-12-14 Thread Ralf Gommers
On Wed, Dec 14, 2011 at 9:54 PM, Chris Barker  wrote:

> On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root  wrote:
> >>> well, yes, though it does do a lot -- do you have a smpler one in mind?
> >>>
> >> Just looking at what I normally wouldn't need for simple data files
> and/or
> >> what a beginning user won't understand at once, the `unpack` and `ndmin`
> >> keywords could certainly be left out. `converters` is also questionable.
> >> That's probably as simple as it can get.
>
> this may be a function of a well written doc string -- if it is clear
> to the newbie that " all the rest of this you don't need unless you
> have a wierd data file", then extra keyword arguments don't really
> hurt.
>
> A few examples of the basic use-cases go a long way.
>
> And yes, the core reader for the complex cases isn't going to e fast
> (it's going to be complex C code...). but we could still have a core
> reader that handled most cases.
>
> Okay, now we're on the same page I think.


> Anyway, I think it's time write code, and see if it can be rolled in
> somehow...
>
> Agreed.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion