Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-23 Thread alan
On Mon, Jan 4, 2010 at 10:39 PM,  a...@ajackson.org wrote:
Hi folks,

I'm taking a look once again at fromfile() for reading text files. I
often have the need to read a LOT of numbers form a text file, and it
can actually be pretty darn slow do i the normal python way:

.big snip

 I agree. I've tried using it, and usually find that it doesn't quite get 
 there.

 I rather like the R command(s) for reading text files - except then I have to
 use R which is painful after using python and numpy. Although ggplot2 is
 awfully nice too ... but that is a later post.

     read.table(file, header = FALSE, sep = , quote = \',
                dec = ., row.names, col.names,
                as.is = !stringsAsFactors,
                na.strings = NA, colClasses = NA, nrows = -1,
                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
                strip.white = FALSE, blank.lines.skip = TRUE,
                comment.char = #,
                allowEscapes = FALSE, flush = FALSE,
                stringsAsFactors = default.stringsAsFactors(),
                fileEncoding = , encoding = unknown)
... big snip


Aren't the newly improved

numpy.genfromtxt(fname, dtype=type 'float', comments='#',
delimiter=None, skiprows=0, converters=None, missing='',
missing_values=None, usecols=None, names=None, excludelist=None,
deletechars=None, case_sensitive=True, unpack=None, usemask=False,
loose=True)

and friends indented to handle all this

Josef


Reopening an old thread...

genfromtxt is a big step forward. Something I'm fiddling with is trying to work
through the book Using R for Data Analysis and Graphics, Introduction, Code,
and Commentary by J H Maindonald (available online), in python. So I am trying
to see what it takes in python/numpy to work his examples and problems, sort of
a learning exercise for me. So anyway, with that introduction, here is a case
that I believe genfromtxt fails on, because it doesn't support the reasonable
(IMHO) behavior of treating quote delimited strings in the input file as a
single field. 

Below is the example from the book... So we have 2 issues. The header for the
first field is quote-blank-quote, and various values for field one have 1 to 3
blank delimited strings, but encapsulated in quotes. I'm putting something
together to read it using shlex.split, since it honors strings protected by
quote pairs.

I'm not an excel person, but I think it might export data like this in a format
similar to what is shown below.

  distance climb time
Greenmantle 2.5 650 16.083
Carnethy 6 2500 48.35
Craig Dunain 6 900 33.65
Ben Rha 7.5 800 45.6
Ben Lomond 8 3070 62.267
Goatfell 8 2866 73.217
Bens of Jura 16 7500 204.617
Cairnpapple 6 800 36.367
Scolty 5 800 29.75
Traprain 6 650 39.75
Lairig Ghru 28 2100 192.667


-- 
---
| Alan K. Jackson| To see a World in a Grain of Sand  |
| a...@ajackson.org  | And a Heaven in a Wild Flower, |
| www.ajackson.org   | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake   |
---
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-23 Thread Christopher Barker
a...@ajackson.org wrote:
 it doesn't support the reasonable
 (IMHO) behavior of treating quote delimited strings in the input file as a
 single field. 

I'd use the csv module for that.

Which makes me wonder if it might make sense to build some of the numpy 
table-reading stuff on top of it...

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

NOAA/ORR/HAZMAT (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-23 Thread josef . pktd
On Sat, Jan 23, 2010 at 1:53 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 a...@ajackson.org wrote:
 it doesn't support the reasonable
 (IMHO) behavior of treating quote delimited strings in the input file as a
 single field.

 I'd use the csv module for that.

 Which makes me wonder if it might make sense to build some of the numpy
 table-reading stuff on top of it...

 -Chris

csv was also my standard module for this, it handles csv dialects and
unicode (with some detour), but having automatic conversion in
genfromtext is nicer.

 reader = csv.reader(open(r'C:\Josef\work-oth\testdata.csv','rb'), 
 delimiter=' ')
 for line in reader:
... print line
...
['Greenmantle', '2.5', '650', '16.083']
['Carnethy', '6', '2500', '48.35']
['Craig Dunain', '6', '900', '33.65']
['Ben Rha', '7.5', '800', '45.6']
['Ben Lomond', '8', '3070', '62.267']
['Goatfell', '8', '2866', '73.217']
['Bens of Jura', '16', '7500', '204.617']
['Cairnpapple', '6', '800', '36.367']
['Scolty', '5', '800', '29.75']
['Traprain', '6', '650', '39.75']
['Lairig Ghru', '28', '2100', '192.667']

Josef



 --
 Christopher Barker, Ph.D.
 Oceanographer

 NOAA/ORR/HAZMAT         (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-09 Thread Pauli Virtanen
pe, 2010-01-08 kello 15:12 -0800, Christopher Barker kirjoitti:
 1) optionally allow newlines to serve as a delimiter, so large tables 
 can be read.

I don't really like handling newlines specially. For instance, I could
have data like

1, 2, 3;
4, 5, 6;
7, 8, 9;

Allowing an alternative separator would sound better to me. The above
data could then be read like

fromfile('foo.txt', sep=' , ', sep2=' ; ')

or perhaps

fromfile('foo.txt', sep=[' , ', ' ; '])

Since whitespace matches also newlines, this would work.

Pauli



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-09 Thread Christopher Barker
Pauli Virtanen wrote:
 I don't really like handling newlines specially. For instance, I could
 have data like
 
   1, 2, 3;
   4, 5, 6;
   7, 8, 9;
 
 Allowing an alternative separator would sound better to me. The above
 data could then be read like
 
   fromfile('foo.txt', sep=' , ', sep2=' ; ')
 
 or perhaps
 
   fromfile('foo.txt', sep=[' , ', ' ; '])

I like this syntax better, but:

1) Yes you could have data like that, but do you? I've never seen it. 
Maybe other have.

2) if you did, it would probably indicate something the user would want 
reserved, like the shape of the array.

And newlines really are a special case -- they have a special meaning, 
and they are very, very common (universal, even)!

So, it's just more code than I'm probably going to write.

If someone does want to write more code than I do, it would probably 
make sense to do what someone suggested in the ticket: write a optimized 
version of loadtxt in C.

Anyway. I'll think about it when I poke at the code more.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-08 Thread Bruce Southey
On Fri, Jan 8, 2010 at 5:12 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 Bruce Southey wrote:
 Also a user has to check for missing
 values or numpy has to warn a user

 I think warnings are next to useless for all but interactive work -- so
 I don't want to rely on them

 that missing values are present
 immediately after reading the data so the appropriate action can be
 taken (like using functions that handle missing values appropriately).
 That is my second problem with using codes (NaN, -9 etc)  for
 missing values.

 But I think you're right -- if someone write code, tests with good
 input, then later runs it with missing valued import, they are likely to
 have not ever bothered to test for missing values.

 So I think missing values should only be replaced by something if the
 user specifically asks for it.

 And the principle of fromfile() is that it is fast and simple, if you
 want masked arrays, use slower, but more full-featured methods.

 So in that case it should fail with missing data.

 Well, I'm not so sure -- the point is performance, no reason not to have
 high performing code that handles missing data.

 What about '\r' and '\n\r'?

 I have thought about that -- I'm hoping that python's text file reading
 will just take care of it, but as we're working with C file handles here
 (I think), I guess not. '/n/r' is easy -- the '/r' is just extra
 whitespace. 'r' is another case to handle.


 My problem with this is that you are reading one huge 1-D array  (that
 you can resize later) rather than a 2-D array with rows and columns
 (which is what I deal with).

 That's because fromfile()) is not designed to be row-oriented at all,
 and the binary read certainly isn't. I'm just trying to make this easy
 -- though it's not turning out that way!

   But I agree that you can have an option
 to say treat '\n' or '\r' as a delimiter but I think it should be
 turned off by default.

 that's what I've done.

 You should have a corresponding value for ints because raising an
 exceptionwould be inconsistent with allowing floats to have a value.

 I'm not sure I care, really -- but I think having the user specify the
 fill value is the best option, anyway.

 josef.p...@gmail.com wrote:
 none -- exactly why I think \n is a special case.
 What about '\r' and '\n\r'?

 Yes, I forgot about this, and it will be the most common case for
 Windows users like myself.

 I think \r should be stripped automatically, like in non-binary
 reading of files in python.

 except for folks like me that have old mac files laying around...so I
 want this like Universal newlines support.

 A warning would be good, but doing np.any(np.isnan(x)) or
 np.isnan(x).sum() on the result is always a good idea for a user when
 missing values are possibility.

 right, but the issue is the user has to know that they are possible, and
 we all know how carefully we all read docs!

 Thanks for your input -- I think I know what I'd like to do, but it's
 proving less than trivial to do it, so we'll see.

 In short:

 1) optionally allow newlines to serve as a delimiter, so large tables
 can be read.

 2) raise an exception for missing values, unless:
   3) the user specifies a fill value of their choice (compatible with
 the chosen data type.


 -Chris



I fully agree with your approach!
Thanks for considering my thoughts!

Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Christopher Barker
Pauli Virtanen wrote:
 ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
 it also does odd things with spaces 
 embedded in the separator:

 , $ # matches all of:  ,$#   , $#  ,$ #

 That's a documented feature:

Fair enough.

OK, I've written a patch that allows newlines to be interpreted as 
separators in addition to whatever is specified in sep.

In the process of testing, I found again these issues, which are still 
marked as needs decision.

http://projects.scipy.org/numpy/ticket/883

In short: what to do with missing values?

I'd like to address this bug, but I need a decision to do so.


My proposal:

Raise an ValueError with missing values.


Justification:

No function should EVER return data that is not there. Period. It is 
simply asking for hard to find bugs. Therefore:

fromstring(3, 4,,5, sep=,)

Should never, ever, return:

array([ 3.,  4.,  0.,  5.])

Which is what it does now. bad. bad. bad.




Alternatives:

   A) Raising a ValueError is the easiest way to get proper behavior. 
Folks can use a more sophisticated file reader if they want missing 
values handled. I'm willing to contribute this patch.

   B) If the dtype is a floating point type, NaN could fill in the 
missing values -- a fine idea, but you can't use it for integers, and 
zero is a really bad replacement!

   C) The user could specify what they want filled in for missing 
values. This is a fine idea, though I'm not sure I want to take the time 
to impliment it.

Oh, and this is a bug too, with probably the same solution:

In [20]: np.fromstring(hjba, sep=',')
Out[20]: array([ 0.])

In [26]: np.fromstring(34gytf39, sep=',')
Out[26]: array([ 34.])


One more unresolved question:

what should:

np.fromstring(3, 4, 5,, sep=,)

return?

it currently returns:

array([ 3.,  4.,  5.])

which seems a bit inconsitent with missing value handling. I also found 
a bug:

In [6]: np.fromstring(3, 4, 5 , , sep=,)
Out[6]: array([ 3.,  4.,  5.,  0.])

so if there is some extra whitespace in there, it does return a missing 
value. With my proposal, that wouldn't happen, but you might get an 
exception. I think you should, but it'll be easier to implement my 
allow newlines code if not.


so, should I do (A) ?


Another question:

I've got a patch mostly working (except for the above issues) that will 
allow fromfile/string to read multiline non-whitespace separated data in 
one shot:


In [15]: str
Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'

In [16]: np.fromstring(str, sep=',', allow_newlines=True)
Out[16]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
 12.])


I think this is a very helpful enhancement, and, as it is a new kwarg, 
backward compatible:

1) Might it be accepted for inclusion?

2) Is the name for the flag OK: allow_newlines? It's pretty explicit, 
but also long -- I used it for the flag name in the C code, too.

3) What C datatype should I use for a boolean flag? I used a char, but I 
don't know what the numpy standard is.


-Chris












-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread josef . pktd
On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 Pauli Virtanen wrote:
 ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
 it also does odd things with spaces
 embedded in the separator:

 , $ # matches all of:  ,$#   , $#  ,$ #

 That's a documented feature:

 Fair enough.

 OK, I've written a patch that allows newlines to be interpreted as
 separators in addition to whatever is specified in sep.

 In the process of testing, I found again these issues, which are still
 marked as needs decision.

 http://projects.scipy.org/numpy/ticket/883

 In short: what to do with missing values?

 I'd like to address this bug, but I need a decision to do so.


 My proposal:

 Raise an ValueError with missing values.


 Justification:

 No function should EVER return data that is not there. Period. It is
 simply asking for hard to find bugs. Therefore:

 fromstring(3, 4,,5, sep=,)

 Should never, ever, return:

 array([ 3.,  4.,  0.,  5.])

 Which is what it does now. bad. bad. bad.




 Alternatives:

   A) Raising a ValueError is the easiest way to get proper behavior.
 Folks can use a more sophisticated file reader if they want missing
 values handled. I'm willing to contribute this patch.

   B) If the dtype is a floating point type, NaN could fill in the
 missing values -- a fine idea, but you can't use it for integers, and
 zero is a really bad replacement!

   C) The user could specify what they want filled in for missing
 values. This is a fine idea, though I'm not sure I want to take the time
 to impliment it.

 Oh, and this is a bug too, with probably the same solution:

 In [20]: np.fromstring(hjba, sep=',')
 Out[20]: array([ 0.])

 In [26]: np.fromstring(34gytf39, sep=',')
 Out[26]: array([ 34.])


 One more unresolved question:

 what should:

 np.fromstring(3, 4, 5,, sep=,)

 return?

 it currently returns:

 array([ 3.,  4.,  5.])

 which seems a bit inconsitent with missing value handling. I also found
 a bug:

 In [6]: np.fromstring(3, 4, 5 , , sep=,)
 Out[6]: array([ 3.,  4.,  5.,  0.])

 so if there is some extra whitespace in there, it does return a missing
 value. With my proposal, that wouldn't happen, but you might get an
 exception. I think you should, but it'll be easier to implement my
 allow newlines code if not.


 so, should I do (A) ?


 Another question:

 I've got a patch mostly working (except for the above issues) that will
 allow fromfile/string to read multiline non-whitespace separated data in
 one shot:


 In [15]: str
 Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'

 In [16]: np.fromstring(str, sep=',', allow_newlines=True)
 Out[16]:
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
         12.])


 I think this is a very helpful enhancement, and, as it is a new kwarg,
 backward compatible:

 1) Might it be accepted for inclusion?

 2) Is the name for the flag OK: allow_newlines? It's pretty explicit,
 but also long -- I used it for the flag name in the C code, too.

 3) What C datatype should I use for a boolean flag? I used a char, but I
 don't know what the numpy standard is.


 -Chris



I don't know much about this, just a few more test cases

comma and newline
str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'

extra comma at end of file
str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'

extra newlines at end of file
str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

It would be nice if these cases would go through without missing
values or exception, but I don't often have files that are clean
enough for fromfile().

I'm in favor of nan for missing values with floating point numbers. It
would make it easy to read correctly formatted csv files, even if the
data is not complete.

Josef










 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Bruce Southey
On Thu, Jan 7, 2010 at 2:32 PM,  josef.p...@gmail.com wrote:
 On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
 chris.bar...@noaa.gov wrote:
 Pauli Virtanen wrote:
 ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
 it also does odd things with spaces
 embedded in the separator:

 , $ # matches all of:  ,$#   , $#  ,$ #

 That's a documented feature:

 Fair enough.

 OK, I've written a patch that allows newlines to be interpreted as
 separators in addition to whatever is specified in sep.

 In the process of testing, I found again these issues, which are still
 marked as needs decision.

 http://projects.scipy.org/numpy/ticket/883

 In short: what to do with missing values?

 I'd like to address this bug, but I need a decision to do so.


 My proposal:

 Raise an ValueError with missing values.


 Justification:

 No function should EVER return data that is not there. Period. It is
 simply asking for hard to find bugs. Therefore:

 fromstring(3, 4,,5, sep=,)

 Should never, ever, return:

 array([ 3.,  4.,  0.,  5.])

 Which is what it does now. bad. bad. bad.




 Alternatives:

   A) Raising a ValueError is the easiest way to get proper behavior.
 Folks can use a more sophisticated file reader if they want missing
 values handled. I'm willing to contribute this patch.

   B) If the dtype is a floating point type, NaN could fill in the
 missing values -- a fine idea, but you can't use it for integers, and
 zero is a really bad replacement!

   C) The user could specify what they want filled in for missing
 values. This is a fine idea, though I'm not sure I want to take the time
 to impliment it.

 Oh, and this is a bug too, with probably the same solution:

 In [20]: np.fromstring(hjba, sep=',')
 Out[20]: array([ 0.])

 In [26]: np.fromstring(34gytf39, sep=',')
 Out[26]: array([ 34.])


 One more unresolved question:

 what should:

 np.fromstring(3, 4, 5,, sep=,)

 return?

 it currently returns:

 array([ 3.,  4.,  5.])

 which seems a bit inconsitent with missing value handling. I also found
 a bug:

 In [6]: np.fromstring(3, 4, 5 , , sep=,)
 Out[6]: array([ 3.,  4.,  5.,  0.])

 so if there is some extra whitespace in there, it does return a missing
 value. With my proposal, that wouldn't happen, but you might get an
 exception. I think you should, but it'll be easier to implement my
 allow newlines code if not.


 so, should I do (A) ?


 Another question:

 I've got a patch mostly working (except for the above issues) that will
 allow fromfile/string to read multiline non-whitespace separated data in
 one shot:


 In [15]: str
 Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'

 In [16]: np.fromstring(str, sep=',', allow_newlines=True)
 Out[16]:
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
         12.])


 I think this is a very helpful enhancement, and, as it is a new kwarg,
 backward compatible:

 1) Might it be accepted for inclusion?

 2) Is the name for the flag OK: allow_newlines? It's pretty explicit,
 but also long -- I used it for the flag name in the C code, too.

 3) What C datatype should I use for a boolean flag? I used a char, but I
 don't know what the numpy standard is.


 -Chris



 I don't know much about this, just a few more test cases

 comma and newline
 str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'

 extra comma at end of file
 str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'

 extra newlines at end of file
 str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

 It would be nice if these cases would go through without missing
 values or exception, but I don't often have files that are clean
 enough for fromfile().

 I'm in favor of nan for missing values with floating point numbers. It
 would make it easy to read correctly formatted csv files, even if the
 data is not complete.



Using the numpy NaN or similar (noting R's approach to missing values
which in turn allows it to have the above functionality) is just a
very bad idea for missing values because you always have to check that
which NaN is a missing value and which was due to some numerical
calculation. It is a very bad idea because we have masked arrays that
nicely but slowly handle this situation.

From what I can see is that you expect that fromfile() should only
split at the supplied delimiters, optionally(?) strip any whitespace
and force a specific dtype. I would agree that the failure of any of
one these should create an exception by default rather than making the
best guess. So 'missing data'  would potentially fail with forcing the
specified dtype. Thus, you should either create an exception for
invalid data (with appropriate location) or use masked arrays.

Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
actually assumes multiple delimiters because there is no comma between
4 and 5 and 8 and 9. So I think it would be better if fromfile
accepted multiple delimiters. In Josef's last case how many 'missing
values should there be?

Bruce

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Travis Oliphant

On Jan 7, 2010, at 2:32 PM, josef.p...@gmail.com wrote:

 On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
 chris.bar...@noaa.gov wrote:
 Pauli Virtanen wrote:
 ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
 it also does odd things with spaces
 embedded in the separator:

 , $ # matches all of:  ,$#   , $#  ,$ #

 That's a documented feature:

 Fair enough.

 OK, I've written a patch that allows newlines to be interpreted as
 separators in addition to whatever is specified in sep.

 In the process of testing, I found again these issues, which are  
 still
 marked as needs decision.

 http://projects.scipy.org/numpy/ticket/883

 In short: what to do with missing values?

 I'd like to address this bug, but I need a decision to do so.


 My proposal:

 Raise an ValueError with missing values.


 Justification:

 No function should EVER return data that is not there. Period. It is
 simply asking for hard to find bugs. Therefore:

 fromstring(3, 4,,5, sep=,)

 Should never, ever, return:

 array([ 3.,  4.,  0.,  5.])

 Which is what it does now. bad. bad. bad.




 Alternatives:

   A) Raising a ValueError is the easiest way to get proper  
 behavior.
 Folks can use a more sophisticated file reader if they want missing
 values handled. I'm willing to contribute this patch.

   B) If the dtype is a floating point type, NaN could fill in the
 missing values -- a fine idea, but you can't use it for integers, and
 zero is a really bad replacement!

   C) The user could specify what they want filled in for missing
 values. This is a fine idea, though I'm not sure I want to take the  
 time
 to impliment it.

 Oh, and this is a bug too, with probably the same solution:

 In [20]: np.fromstring(hjba, sep=',')
 Out[20]: array([ 0.])

 In [26]: np.fromstring(34gytf39, sep=',')
 Out[26]: array([ 34.])


 One more unresolved question:

 what should:

 np.fromstring(3, 4, 5,, sep=,)

 return?

 it currently returns:

 array([ 3.,  4.,  5.])

 which seems a bit inconsitent with missing value handling. I also  
 found
 a bug:

 In [6]: np.fromstring(3, 4, 5 , , sep=,)
 Out[6]: array([ 3.,  4.,  5.,  0.])

 so if there is some extra whitespace in there, it does return a  
 missing
 value. With my proposal, that wouldn't happen, but you might get an
 exception. I think you should, but it'll be easier to implement my
 allow newlines code if not.


 so, should I do (A) ?


 Another question:

 I've got a patch mostly working (except for the above issues) that  
 will
 allow fromfile/string to read multiline non-whitespace separated  
 data in
 one shot:


 In [15]: str
 Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'

 In [16]: np.fromstring(str, sep=',', allow_newlines=True)
 Out[16]:
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,   
 11.,
 12.])


 I think this is a very helpful enhancement, and, as it is a new  
 kwarg,
 backward compatible:

 1) Might it be accepted for inclusion?

 2) Is the name for the flag OK: allow_newlines? It's pretty  
 explicit,
 but also long -- I used it for the flag name in the C code, too.

 3) What C datatype should I use for a boolean flag? I used a char,  
 but I
 don't know what the numpy standard is.


 -Chris



 I don't know much about this, just a few more test cases

 comma and newline
 str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'

 extra comma at end of file
 str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'

 extra newlines at end of file
 str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

 It would be nice if these cases would go through without missing
 values or exception, but I don't often have files that are clean
 enough for fromfile().

+1 (ignoring new-lines transparently is a nice feature).  You can also  
use sscanf with weave to read most files.


 I'm in favor of nan for missing values with floating point numbers. It
 would make it easy to read correctly formatted csv files, even if the
 data is not complete.

+1   (much preferrable to insert NaN or other user value than raise  
ValueError in my opinion)

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Christopher Barker
Bruce Southey wrote:
 chris.bar...@noaa.gov wrote:

 Using the numpy NaN or similar (noting R's approach to missing values
 which in turn allows it to have the above functionality) is just a
 very bad idea for missing values because you always have to check that
 which NaN is a missing value and which was due to some numerical
 calculation.

well, this is specific to reading files, so you know where it came from. 
And the principle of fromfile() is that it is fast and simple, if you 
want masked arrays, use slower, but more full-featured methods.

However, in this case:

In [9]: np.fromstring(3, 4, NaN, 5, sep=,)
Out[9]: array([  3.,   4.,  NaN,   5.])


An actual NaN is read from the file, rather than a missing value. 
Perhaps the user does want the distinction, so maybe it should really 
only fil it in if the users asks for it, but specifying 
missing_value=np.nan or something.

From what I can see is that you expect that fromfile() should only
 split at the supplied delimiters, optionally(?) strip any whitespace

whitespace stripping is not optional.

 Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
 actually assumes multiple delimiters because there is no comma between
 4 and 5 and 8 and 9.

Yes, that's the point. I thought about allowing arbitrary multiple 
delimiters, but I think '/n' is a special case - for instance, a comma 
at the end of some numbers might mean missing data, but a '\n' would not.

And I couldn't really think of a useful use-case for arbitrary multiple 
delimiters.

 In Josef's last case how many 'missing values should there be?

  extra newlines at end of file
  str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

none -- exactly why I think \n is a special case.

What about:
  extra newlines in the middle of the file
  str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'

I think they should be ignored, but I hope I'm not making something that 
is too specific to my personal needs.

Travis Oliphant wrote:
 +1 (ignoring new-lines transparently is a nice feature).  You can also  
 use sscanf with weave to read most files.

right -- but that requires weave. In fact, MATLAB has a fscanf function 
that allows you to pass in a C format string and it vectorizes it to use 
the same one over an over again until it's done. It's actually quite 
powerful and flexible. I once started with that in mind, but didn't have 
the C chops to do it. I ended up with a tool that only did doubles (come 
to think of it, MATLAB only does doubles, anyway...)

I may some day write a whole new C (or, more likely, Cython) function 
that does something like that, but for now, I'm jsut trying to get 
fromfile to be useful for me.


 +1   (much preferrable to insert NaN or other user value than raise  
 ValueError in my opinion)

But raise an error for integer types?

I guess this is still up the air -- no consensus yet.

Thanks,

-Chris









-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread josef . pktd
On Thu, Jan 7, 2010 at 4:45 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 Bruce Southey wrote:
 chris.bar...@noaa.gov wrote:

 Using the numpy NaN or similar (noting R's approach to missing values
 which in turn allows it to have the above functionality) is just a
 very bad idea for missing values because you always have to check that
 which NaN is a missing value and which was due to some numerical
 calculation.

 well, this is specific to reading files, so you know where it came from.
 And the principle of fromfile() is that it is fast and simple, if you
 want masked arrays, use slower, but more full-featured methods.

 However, in this case:

 In [9]: np.fromstring(3, 4, NaN, 5, sep=,)
 Out[9]: array([  3.,   4.,  NaN,   5.])


 An actual NaN is read from the file, rather than a missing value.
 Perhaps the user does want the distinction, so maybe it should really
 only fil it in if the users asks for it, but specifying
 missing_value=np.nan or something.

From what I can see is that you expect that fromfile() should only
 split at the supplied delimiters, optionally(?) strip any whitespace

 whitespace stripping is not optional.

 Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
 actually assumes multiple delimiters because there is no comma between
 4 and 5 and 8 and 9.

 Yes, that's the point. I thought about allowing arbitrary multiple
 delimiters, but I think '/n' is a special case - for instance, a comma
 at the end of some numbers might mean missing data, but a '\n' would not.

 And I couldn't really think of a useful use-case for arbitrary multiple
 delimiters.

 In Josef's last case how many 'missing values should there be?

   extra newlines at end of file
   str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

 none -- exactly why I think \n is a special case.

 What about:
   extra newlines in the middle of the file
   str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'

 I think they should be ignored, but I hope I'm not making something that
 is too specific to my personal needs.

 Travis Oliphant wrote:
 +1 (ignoring new-lines transparently is a nice feature).  You can also
 use sscanf with weave to read most files.

 right -- but that requires weave. In fact, MATLAB has a fscanf function
 that allows you to pass in a C format string and it vectorizes it to use
 the same one over an over again until it's done. It's actually quite
 powerful and flexible. I once started with that in mind, but didn't have
 the C chops to do it. I ended up with a tool that only did doubles (come
 to think of it, MATLAB only does doubles, anyway...)

 I may some day write a whole new C (or, more likely, Cython) function
 that does something like that, but for now, I'm jsut trying to get
 fromfile to be useful for me.


 +1   (much preferrable to insert NaN or other user value than raise
 ValueError in my opinion)

 But raise an error for integer types?

 I guess this is still up the air -- no consensus yet.

raise an exception, I hate the silent cast of nan to integer zero, too
much debugging and useless if there are real zeros.
(or use some -999 kind of thing if user defined nan codes are allowed,
but I just work with float if I expect nans/missing values.)

Josef


 Thanks,

 -Chris









 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Christopher Barker
josef.p...@gmail.com wrote:
 +1   (much preferrable to insert NaN or other user value than raise
 ValueError in my opinion)

 But raise an error for integer types?

 I guess this is still up the air -- no consensus yet.
 
 raise an exception, I hate the silent cast of nan to integer zero,

me too -- I'm sorry, I wasn't clear -- I'm not going to write any code 
that returns a zero for a missing value. These are the options I'd consider:

1) Have the user specify what to use for missing values, otherwise, 
raise an exception

2) Insert a NaN for floating points types, and raise an exception for 
integer types.

what's not clear is whether (2) is a good idea. As for (1), I just don't 
know if I'm going to get around to writing the code, and I maybe more 
kwargs is a bad idea -- though maybe not.

Enough talk: I've got ugly C code to wade through...

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Bruce Southey
On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 Bruce Southey wrote:
 chris.bar...@noaa.gov wrote:

 Using the numpy NaN or similar (noting R's approach to missing values
 which in turn allows it to have the above functionality) is just a
 very bad idea for missing values because you always have to check that
 which NaN is a missing value and which was due to some numerical
 calculation.

 well, this is specific to reading files, so you know where it came from.

You can only know where it came from when you compare the original
array to the transformed one. Also a user has to check for missing
values or numpy has to warn a user that missing values are present
immediately after reading the data so the appropriate action can be
taken (like using functions that handle missing values appropriately).
That is my second problem with using codes (NaN, -9 etc)  for
missing values.



 And the principle of fromfile() is that it is fast and simple, if you
 want masked arrays, use slower, but more full-featured methods.

So in that case it should fail with missing data.


 However, in this case:

 In [9]: np.fromstring(3, 4, NaN, 5, sep=,)
 Out[9]: array([  3.,   4.,  NaN,   5.])


 An actual NaN is read from the file, rather than a missing value.
 Perhaps the user does want the distinction, so maybe it should really
 only fil it in if the users asks for it, but specifying
 missing_value=np.nan or something.

Yes, that is my first problem of using predefined codes for missing
values as you do not always know what is going to occur in the data.



From what I can see is that you expect that fromfile() should only
 split at the supplied delimiters, optionally(?) strip any whitespace

 whitespace stripping is not optional.

 Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
 actually assumes multiple delimiters because there is no comma between
 4 and 5 and 8 and 9.

 Yes, that's the point. I thought about allowing arbitrary multiple
 delimiters, but I think '/n' is a special case - for instance, a comma
 at the end of some numbers might mean missing data, but a '\n' would not.

 And I couldn't really think of a useful use-case for arbitrary multiple
 delimiters.

 In Josef's last case how many 'missing values should there be?

   extra newlines at end of file
   str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

 none -- exactly why I think \n is a special case.

What about '\r' and '\n\r'?


 What about:
   extra newlines in the middle of the file
   str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'

 I think they should be ignored, but I hope I'm not making something that
 is too specific to my personal needs.

Not really, it is more that I am being somewhat difficult to ensure I
understand what you actually need.

My problem with this is that you are reading one huge 1-D array  (that
you can resize later) rather than a 2-D array with rows and columns
(which is what I deal with). But I agree that you can have an option
to say treat '\n' or '\r' as a delimiter but I think it should be
turned off by default.



 Travis Oliphant wrote:
 +1 (ignoring new-lines transparently is a nice feature).  You can also
 use sscanf with weave to read most files.

 right -- but that requires weave. In fact, MATLAB has a fscanf function
 that allows you to pass in a C format string and it vectorizes it to use
 the same one over an over again until it's done. It's actually quite
 powerful and flexible. I once started with that in mind, but didn't have
 the C chops to do it. I ended up with a tool that only did doubles (come
 to think of it, MATLAB only does doubles, anyway...)

 I may some day write a whole new C (or, more likely, Cython) function
 that does something like that, but for now, I'm jsut trying to get
 fromfile to be useful for me.


 +1   (much preferrable to insert NaN or other user value than raise
 ValueError in my opinion)

 But raise an error for integer types?

 I guess this is still up the air -- no consensus yet.

 Thanks,

 -Chris


You should have a corresponding value for ints because raising an
exceptionwould be inconsistent with allowing floats to have a value.
If you must keep the user defined dtype then, as Josef suggests, just
use some code be it -999 or most negative number supported by the OS
for the defined dtype or, just convert the ints into floats if the
user does not define a missing value code.  It would be nice to either
return the number of missing values or display a warning indicating
how many occurred.

Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread josef . pktd
On Thu, Jan 7, 2010 at 11:10 PM, Bruce Southey bsout...@gmail.com wrote:
 On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker
 chris.bar...@noaa.gov wrote:
 Bruce Southey wrote:
 chris.bar...@noaa.gov wrote:

 Using the numpy NaN or similar (noting R's approach to missing values
 which in turn allows it to have the above functionality) is just a
 very bad idea for missing values because you always have to check that
 which NaN is a missing value and which was due to some numerical
 calculation.

 well, this is specific to reading files, so you know where it came from.

 You can only know where it came from when you compare the original
 array to the transformed one. Also a user has to check for missing
 values or numpy has to warn a user that missing values are present
 immediately after reading the data so the appropriate action can be
 taken (like using functions that handle missing values appropriately).
 That is my second problem with using codes (NaN, -9 etc)  for
 missing values.



 And the principle of fromfile() is that it is fast and simple, if you
 want masked arrays, use slower, but more full-featured methods.

 So in that case it should fail with missing data.


 However, in this case:

 In [9]: np.fromstring(3, 4, NaN, 5, sep=,)
 Out[9]: array([  3.,   4.,  NaN,   5.])


 An actual NaN is read from the file, rather than a missing value.
 Perhaps the user does want the distinction, so maybe it should really
 only fil it in if the users asks for it, but specifying
 missing_value=np.nan or something.

 Yes, that is my first problem of using predefined codes for missing
 values as you do not always know what is going to occur in the data.



From what I can see is that you expect that fromfile() should only
 split at the supplied delimiters, optionally(?) strip any whitespace

 whitespace stripping is not optional.

 Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
 actually assumes multiple delimiters because there is no comma between
 4 and 5 and 8 and 9.

 Yes, that's the point. I thought about allowing arbitrary multiple
 delimiters, but I think '/n' is a special case - for instance, a comma
 at the end of some numbers might mean missing data, but a '\n' would not.

 And I couldn't really think of a useful use-case for arbitrary multiple
 delimiters.

 In Josef's last case how many 'missing values should there be?

   extra newlines at end of file
   str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

 none -- exactly why I think \n is a special case.

 What about '\r' and '\n\r'?

Yes, I forgot about this, and it will be the most common case for
Windows users like myself.

I think \r should be stripped automatically, like in non-binary
reading of files in python.



 What about:
   extra newlines in the middle of the file
   str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'

 I think they should be ignored, but I hope I'm not making something that
 is too specific to my personal needs.

 Not really, it is more that I am being somewhat difficult to ensure I
 understand what you actually need.

 My problem with this is that you are reading one huge 1-D array  (that
 you can resize later) rather than a 2-D array with rows and columns
 (which is what I deal with). But I agree that you can have an option
 to say treat '\n' or '\r' as a delimiter but I think it should be
 turned off by default.



 Travis Oliphant wrote:
 +1 (ignoring new-lines transparently is a nice feature).  You can also
 use sscanf with weave to read most files.

 right -- but that requires weave. In fact, MATLAB has a fscanf function
 that allows you to pass in a C format string and it vectorizes it to use
 the same one over an over again until it's done. It's actually quite
 powerful and flexible. I once started with that in mind, but didn't have
 the C chops to do it. I ended up with a tool that only did doubles (come
 to think of it, MATLAB only does doubles, anyway...)

 I may some day write a whole new C (or, more likely, Cython) function
 that does something like that, but for now, I'm jsut trying to get
 fromfile to be useful for me.


 +1   (much preferrable to insert NaN or other user value than raise
 ValueError in my opinion)

 But raise an error for integer types?

 I guess this is still up the air -- no consensus yet.

 Thanks,

 -Chris


 You should have a corresponding value for ints because raising an
 exceptionwould be inconsistent with allowing floats to have a value.

No, I think different nan/missing value handling between integers and
float is a natural distinction. There is no default nan code for
integers, but nan (and inf) are valid floating point numbers (even if
nan is not a number). And the default treatment of nans in numpy is
getting pretty good (e.g. I like the new (nan)sort).


 If you must keep the user defined dtype then, as Josef suggests, just
 use some code be it -999 or most negative number supported by the OS
 for the defined dtype or, just convert the ints 

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Paul Ivanov
Christopher Barker, on 2010-01-04 17:05, wrote:
 Hi folks,
 
 I'm taking a look once again at fromfile() for reading text files. I 
 often have the need to read a LOT of numbers form a text file, and it 
 can actually be pretty darn slow do i the normal python way:
 
 for line in file:
 data = map(float, line.strip().split())
 
 
 or various other versions that are similar. It really does take longer 
 to read the text, split it up, convert to a number, then put that number 
 into a numpy array, than it does to simply read it straight into the array.
 
 However, as it stands, fromfile() turn out to be next to useless for 
 anything but whitespace separated text. Full set of ideas here:
 
 http://projects.scipy.org/numpy/ticket/909
 
 However, for the moment, I'm digging into the code to address a 
 particular problem -- reading files like this:
 
 123, 65.6, 789
 23,  3.2,  34
 ...
 
 That is comma (or whatever) separated text -- pretty common stuff.
 
 The problem with the current code is that you can't read more than one 
 line at time with fromfile:
 
 a = np.fromfile(infile, sep=,)
 
 will read until it doesn't find a comma, and thus only one line, as 
 there is no comma after each line. As this is a really typical case, I 
 think it should be supported.

Just a potshot, but have you tried np.loadtxt?

I find it pretty fast.

 
 Here is the question:
 
 The work of finding the separator is done in:
 
 multiarray/ctors.c:  fromfile_skip_separator()
 
 It looks like it wouldn't be too hard to add some code in there to look 
 for a newline, and consider that a valid separator. However, that would 
 break backward compatibility. So maybe a flag could be passed in, saying 
 you wanted to support newlines. The problem is that flag would have to 
 get passed all the way through to this function (and also for fromstring).
 
 I also notice that it supports separators of arbitrary length, which I 
 wonder how useful that is. But it also does odd things with spaces 
 embedded in the separator:
 
 , $ # matches all of:  ,$#   , $#  ,$ #
 
 Is it worth trying to fix that?
 
 
 In the longer term, it would be really nice to support comments as well, 
 tough that would require more of a re-factoring of the code, I think 
 (though maybe not -- I suppose a call to fromfile_skip_separator() could 
 look for a comment character, then if it found one, skip to where the 
 comment ends -- hmmm.
 
 thanks for any feedback,
 
 -Chris
 
 
 
 
 
 
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Christopher Barker
josef.p...@gmail.com wrote:
 On Mon, Jan 4, 2010 at 10:39 PM,  a...@ajackson.org wrote:
 I rather like the R command(s) for reading text files

 Aren't the newly improved
 
 numpy.genfromtxt()

...

 and friends indented to handle all this

Yes, they are, and they are great, but not really all that fast. If 
you've got big complicated tables of data to read, then genfromtxt is 
the way to go -- it's a great tool. However, for the simple stuff, it's 
not really optimized. I also find I have to read a lot of text files 
that aren't tables of data, but rather an odd mix of stuff, but still a 
lot of reading lots of numbers from a file. As far as I can tell, 
genfromtxt and loadtxt can only load the entire file as a table (a very 
common situation, of course).


Paul Ivanov wrote:
 Just a potshot, but have you tried np.loadtxt?
 
 I find it pretty fast.

I guess I should have posted timings in the first place:

In [19]: timeit timing.time_genfromtxt()
10 loops, best of 3: 216 ms per loop

In [20]: timeit timing.time_loadtxt()
10 loops, best of 3: 166 ms per loop

In [21]: timeit timing.time_fromfile()
10 loops, best of 3: 47.1 ms per loop

(40,000 doubles from a space-delimted text file)

so fromfile() is 3.5 times as fast as loadtxt and 4.5 times as fast as 
genfromtxt. That does make a difference for me -- the user waiting 4 
seconds, rather than one second to load a file matters.

I suppose another option might be to see if I can optimize the inner 
scanning function of genfromtxt with Cython or C, but I'm not sure 
that's possible, as it's really very flexible, and re-writing all of 
that without Python would be really painful!


-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Pierre GM
On Jan 5, 2010, at 12:32 PM, Christopher Barker wrote:
 josef.p...@gmail.com wrote:
 On Mon, Jan 4, 2010 at 10:39 PM,  a...@ajackson.org wrote:
 I rather like the R command(s) for reading text files
 
 Aren't the newly improved
 
 numpy.genfromtxt()
 
 ...
 
 and friends indented to handle all this
 
 Yes, they are, and they are great, but not really all that fast. If 
 you've got big complicated tables of data to read, then genfromtxt is 
 the way to go -- it's a great tool. However, for the simple stuff, it's 
 not really optimized.

genfromtxt is nothing but loadtxt overloaded to deal with undefined dtype and 
missing entries. It's doomed to be slower, and it shouldn't be used if you know 
your data is well-defined and well-behaved. Stick to loadtxt

 I also find I have to read a lot of text files 
 that aren't tables of data, but rather an odd mix of stuff, but still a 
 lot of reading lots of numbers from a file.

Well, everything depends on what kind of stuff you have in your mix, I guess...

 so fromfile() is 3.5 times as fast as loadtxt and 4.5 times as fast as 
 genfromtxt. That does make a difference for me -- the user waiting 4 
 seconds, rather than one second to load a file matters.

Rmmbr that fromfile is C when loadtxt and genfromtxt are Python...

 I suppose another option might be to see if I can optimize the inner 
 scanning function of genfromtxt with Cython or C, but I'm not sure 
 that's possible, as it's really very flexible, and re-writing all of 
 that without Python would be really painful!


Well, there's room for some optimization for particular cases (dtype!=None), 
but the generic case will be tricky...


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Pauli Virtanen
ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
[clip]
 I also notice that it supports separators of arbitrary length, which I 
 wonder how useful that is. But it also does odd things with spaces 
 embedded in the separator:
 
 , $ # matches all of:  ,$#   , $#  ,$ #
 
 Is it worth trying to fix that?

That's a documented feature:

sep : str
Separator between items if file is a text file.
Empty () separator means the file should be treated as binary.
Spaces ( ) in the separator match zero or more whitespace
characters. A separator consisting only of spaces must match at
least one whitespace.

-- 
Pauli Virtanen



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-04 Thread alan
Hi folks,

I'm taking a look once again at fromfile() for reading text files. I 
often have the need to read a LOT of numbers form a text file, and it 
can actually be pretty darn slow do i the normal python way:

for line in file:
data = map(float, line.strip().split())


or various other versions that are similar. It really does take longer 
to read the text, split it up, convert to a number, then put that number 
into a numpy array, than it does to simply read it straight into the array.

However, as it stands, fromfile() turn out to be next to useless for 
anything but whitespace separated text. Full set of ideas here:

http://projects.scipy.org/numpy/ticket/909

However, for the moment, I'm digging into the code to address a 
particular problem -- reading files like this:

123, 65.6, 789
23,  3.2,  34
...

That is comma (or whatever) separated text -- pretty common stuff.

The problem with the current code is that you can't read more than one 
line at time with fromfile:

a = np.fromfile(infile, sep=,)

will read until it doesn't find a comma, and thus only one line, as 
there is no comma after each line. As this is a really typical case, I 
think it should be supported.

Here is the question:

The work of finding the separator is done in:

multiarray/ctors.c:  fromfile_skip_separator()

It looks like it wouldn't be too hard to add some code in there to look 
for a newline, and consider that a valid separator. However, that would 
break backward compatibility. So maybe a flag could be passed in, saying 
you wanted to support newlines. The problem is that flag would have to 
get passed all the way through to this function (and also for fromstring).

I also notice that it supports separators of arbitrary length, which I 
wonder how useful that is. But it also does odd things with spaces 
embedded in the separator:

, $ # matches all of:  ,$#   , $#  ,$ #

Is it worth trying to fix that?


In the longer term, it would be really nice to support comments as well, 
tough that would require more of a re-factoring of the code, I think 
(though maybe not -- I suppose a call to fromfile_skip_separator() could 
look for a comment character, then if it found one, skip to where the 
comment ends -- hmmm.

thanks for any feedback,

-Chris


I agree. I've tried using it, and usually find that it doesn't quite get there.

I rather like the R command(s) for reading text files - except then I have to
use R which is painful after using python and numpy. Although ggplot2 is
awfully nice too ... but that is a later post.

 read.table(file, header = FALSE, sep = , quote = \',
dec = ., row.names, col.names,
as.is = !stringsAsFactors,
na.strings = NA, colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = #,
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = , encoding = unknown)
 
 read.csv(file, header = TRUE, sep = ,, quote=\, dec=.,
  fill = TRUE, comment.char=, ...)
 
 read.csv2(file, header = TRUE, sep = ;, quote=\, dec=,,
   fill = TRUE, comment.char=, ...)
 
 read.delim(file, header = TRUE, sep = \t, quote=\, dec=.,
fill = TRUE, comment.char=, ...)
 
 read.delim2(file, header = TRUE, sep = \t, quote=\, dec=,,
 fill = TRUE, comment.char=, ...)


There is really only read.table, the others are just aliases with different
defaults.  But the flexibility is great, as you can see.

-- 
---
| Alan K. Jackson| To see a World in a Grain of Sand  |
| a...@ajackson.org  | And a Heaven in a Wild Flower, |
| www.ajackson.org   | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake   |
---
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-04 Thread josef . pktd
On Mon, Jan 4, 2010 at 10:39 PM,  a...@ajackson.org wrote:
Hi folks,

I'm taking a look once again at fromfile() for reading text files. I
often have the need to read a LOT of numbers form a text file, and it
can actually be pretty darn slow do i the normal python way:

for line in file:
    data = map(float, line.strip().split())


or various other versions that are similar. It really does take longer
to read the text, split it up, convert to a number, then put that number
into a numpy array, than it does to simply read it straight into the array.

However, as it stands, fromfile() turn out to be next to useless for
anything but whitespace separated text. Full set of ideas here:

http://projects.scipy.org/numpy/ticket/909

However, for the moment, I'm digging into the code to address a
particular problem -- reading files like this:

123, 65.6, 789
23,  3.2,  34
...

That is comma (or whatever) separated text -- pretty common stuff.

The problem with the current code is that you can't read more than one
line at time with fromfile:

a = np.fromfile(infile, sep=,)

will read until it doesn't find a comma, and thus only one line, as
there is no comma after each line. As this is a really typical case, I
think it should be supported.

Here is the question:

The work of finding the separator is done in:

multiarray/ctors.c:  fromfile_skip_separator()

It looks like it wouldn't be too hard to add some code in there to look
for a newline, and consider that a valid separator. However, that would
break backward compatibility. So maybe a flag could be passed in, saying
you wanted to support newlines. The problem is that flag would have to
get passed all the way through to this function (and also for fromstring).

I also notice that it supports separators of arbitrary length, which I
wonder how useful that is. But it also does odd things with spaces
embedded in the separator:

, $ # matches all of:  ,$#   , $#  ,$ #

Is it worth trying to fix that?


In the longer term, it would be really nice to support comments as well,
tough that would require more of a re-factoring of the code, I think
(though maybe not -- I suppose a call to fromfile_skip_separator() could
look for a comment character, then if it found one, skip to where the
comment ends -- hmmm.

thanks for any feedback,

-Chris


 I agree. I've tried using it, and usually find that it doesn't quite get 
 there.

 I rather like the R command(s) for reading text files - except then I have to
 use R which is painful after using python and numpy. Although ggplot2 is
 awfully nice too ... but that is a later post.

     read.table(file, header = FALSE, sep = , quote = \',
                dec = ., row.names, col.names,
                as.is = !stringsAsFactors,
                na.strings = NA, colClasses = NA, nrows = -1,
                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
                strip.white = FALSE, blank.lines.skip = TRUE,
                comment.char = #,
                allowEscapes = FALSE, flush = FALSE,
                stringsAsFactors = default.stringsAsFactors(),
                fileEncoding = , encoding = unknown)

     read.csv(file, header = TRUE, sep = ,, quote=\, dec=.,
              fill = TRUE, comment.char=, ...)

     read.csv2(file, header = TRUE, sep = ;, quote=\, dec=,,
               fill = TRUE, comment.char=, ...)

     read.delim(file, header = TRUE, sep = \t, quote=\, dec=.,
                fill = TRUE, comment.char=, ...)

     read.delim2(file, header = TRUE, sep = \t, quote=\, dec=,,
                 fill = TRUE, comment.char=, ...)


 There is really only read.table, the others are just aliases with different
 defaults.  But the flexibility is great, as you can see.


Aren't the newly improved

numpy.genfromtxt(fname, dtype=type 'float', comments='#',
delimiter=None, skiprows=0, converters=None, missing='',
missing_values=None, usecols=None, names=None, excludelist=None,
deletechars=None, case_sensitive=True, unpack=None, usemask=False,
loose=True)

and friends indented to handle all this

Josef


 --
 ---
 | Alan K. Jackson            | To see a World in a Grain of Sand      |
 | a...@ajackson.org          | And a Heaven in a Wild Flower,         |
 | www.ajackson.org           | Hold Infinity in the palm of your hand |
 | Houston, Texas             | And Eternity in an hour. - Blake       |
 ---
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion