subject:"Re\: \[Numpy\-discussion\] fromfile\(\) for reading text \(one more time\!\)"

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-23 Thread josef . pktd

On Sat, Jan 23, 2010 at 1:53 PM, Christopher Barker
 wrote:
> a...@ajackson.org wrote:
>> it doesn't support the reasonable
>> (IMHO) behavior of treating quote delimited strings in the input file as a
>> single field.
>
> I'd use the csv module for that.
>
> Which makes me wonder if it might make sense to build some of the numpy
> table-reading stuff on top of it...
>
> -Chris

csv was also my standard module for this, it handles csv dialects and
unicode (with some detour), but having automatic conversion in
genfromtext is nicer.

>>> reader = csv.reader(open(r'C:\Josef\work-oth\testdata.csv','rb'), 
>>> delimiter=' ')
>>> for line in reader:
... print line
...
['Greenmantle', '2.5', '650', '16.083']
['Carnethy', '6', '2500', '48.35']
['Craig Dunain', '6', '900', '33.65']
['Ben Rha', '7.5', '800', '45.6']
['Ben Lomond', '8', '3070', '62.267']
['Goatfell', '8', '2866', '73.217']
['Bens of Jura', '16', '7500', '204.617']
['Cairnpapple', '6', '800', '36.367']
['Scolty', '5', '800', '29.75']
['Traprain', '6', '650', '39.75']
['Lairig Ghru', '28', '2100', '192.667']

Josef

>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> NOAA/OR&R/HAZMAT         (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-23 Thread Christopher Barker

a...@ajackson.org wrote:
> it doesn't support the reasonable
> (IMHO) behavior of treating quote delimited strings in the input file as a
> single field. 

I'd use the csv module for that.

Which makes me wonder if it might make sense to build some of the numpy 
table-reading stuff on top of it...

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

NOAA/OR&R/HAZMAT (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-23 Thread alan

>On Mon, Jan 4, 2010 at 10:39 PM,   wrote:
>>>Hi folks,
>>>
>>>I'm taking a look once again at fromfile() for reading text files. I
>>>often have the need to read a LOT of numbers form a text file, and it
>>>can actually be pretty darn slow do i the normal python way:
>>>
.big snip
>>
>> I agree. I've tried using it, and usually find that it doesn't quite get 
>> there.
>>
>> I rather like the R command(s) for reading text files - except then I have to
>> use R which is painful after using python and numpy. Although ggplot2 is
>> awfully nice too ... but that is a later post.
>>
>>     read.table(file, header = FALSE, sep = "", quote = "\"'",
>>                dec = ".", row.names, col.names,
>>                as.is = !stringsAsFactors,
>>                na.strings = "NA", colClasses = NA, nrows = -1,
>>                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
>>                strip.white = FALSE, blank.lines.skip = TRUE,
>>                comment.char = "#",
>>                allowEscapes = FALSE, flush = FALSE,
>>                stringsAsFactors = default.stringsAsFactors(),
>>                fileEncoding = "", encoding = "unknown")
... big snip
>
>
>Aren't the newly improved
>
>numpy.genfromtxt(fname, dtype=, comments='#',
>delimiter=None, skiprows=0, converters=None, missing='',
>missing_values=None, usecols=None, names=None, excludelist=None,
>deletechars=None, case_sensitive=True, unpack=None, usemask=False,
>loose=True)
>
>and friends indented to handle all this
>
>Josef
>

Reopening an old thread...

genfromtxt is a big step forward. Something I'm fiddling with is trying to work
through the book "Using R for Data Analysis and Graphics, Introduction, Code,
and Commentary" by J H Maindonald (available online), in python. So I am trying
to see what it takes in python/numpy to work his examples and problems, sort of
a learning exercise for me. So anyway, with that introduction, here is a case
that I believe genfromtxt fails on, because it doesn't support the reasonable
(IMHO) behavior of treating quote delimited strings in the input file as a
single field. 

Below is the example from the book... So we have 2 issues. The header for the
first field is quote-blank-quote, and various values for field one have 1 to 3
blank delimited strings, but encapsulated in quotes. I'm putting something
together to read it using shlex.split, since it honors strings protected by
quote pairs.

I'm not an excel person, but I think it might export data like this in a format
similar to what is shown below.

" " "distance" "climb" "time"
"Greenmantle" 2.5 650 16.083
"Carnethy" 6 2500 48.35
"Craig Dunain" 6 900 33.65
"Ben Rha" 7.5 800 45.6
"Ben Lomond" 8 3070 62.267
"Goatfell" 8 2866 73.217
"Bens of Jura" 16 7500 204.617
"Cairnpapple" 6 800 36.367
"Scolty" 5 800 29.75
"Traprain" 6 650 39.75
"Lairig Ghru" 28 2100 192.667

-- 
---
| Alan K. Jackson| To see a World in a Grain of Sand  |
| a...@ajackson.org  | And a Heaven in a Wild Flower, |
| www.ajackson.org   | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake   |
---
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-09 Thread Christopher Barker

Pauli Virtanen wrote:
> I don't really like handling newlines specially. For instance, I could
> have data like
> 
>   1, 2, 3;
>   4, 5, 6;
>   7, 8, 9;
> 
> Allowing an "alternative separator" would sound better to me. The above
> data could then be read like
> 
>   fromfile('foo.txt', sep=' , ', sep2=' ; ')
> 
> or perhaps
> 
>   fromfile('foo.txt', sep=[' , ', ' ; '])

I like this syntax better, but:

1) Yes you "could" have data like that, but do you? I've never seen it. 
Maybe other have.

2) if you did, it would probably indicate something the user would want 
reserved, like the shape of the array.

And newlines really are a special case -- they have a special meaning, 
and they are very, very common (universal, even)!

So, it's just more code than I'm probably going to write.

If someone does want to write more code than I do, it would probably 
make sense to do what someone suggested in the ticket: write a optimized 
version of loadtxt in C.

Anyway. I'll think about it when I poke at the code more.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-09 Thread Pauli Virtanen

pe, 2010-01-08 kello 15:12 -0800, Christopher Barker kirjoitti:
> 1) optionally allow newlines to serve as a delimiter, so large tables 
> can be read.

I don't really like handling newlines specially. For instance, I could
have data like

1, 2, 3;
4, 5, 6;
7, 8, 9;

Allowing an "alternative separator" would sound better to me. The above
data could then be read like

fromfile('foo.txt', sep=' , ', sep2=' ; ')

or perhaps

fromfile('foo.txt', sep=[' , ', ' ; '])

Since whitespace matches also newlines, this would work.

Pauli



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-08 Thread Bruce Southey

On Fri, Jan 8, 2010 at 5:12 PM, Christopher Barker
 wrote:
> Bruce Southey wrote:
>> Also a user has to check for missing
>> values or numpy has to warn a user
>
> I think warnings are next to useless for all but interactive work -- so
> I don't want to rely on them
>
>> that missing values are present
>> immediately after reading the data so the appropriate action can be
>> taken (like using functions that handle missing values appropriately).
>> That is my second problem with using codes (NaN, -9 etc)  for
>> missing values.
>
> But I think you're right -- if someone write code, tests with good
> input, then later runs it with missing valued import, they are likely to
> have not ever bothered to test for missing values.
>
> So I think missing values should only be replaced by something if the
> user specifically asks for it.
>
>>> And the principle of fromfile() is that it is fast and simple, if you
>>> want masked arrays, use slower, but more full-featured methods.
>>
>> So in that case it should fail with missing data.
>
> Well, I'm not so sure -- the point is performance, no reason not to have
> high performing code that handles missing data.
>
>> What about '\r' and '\n\r'?
>
> I have thought about that -- I'm hoping that python's text file reading
> will just take care of it, but as we're working with C file handles here
> (I think), I guess not. '/n/r' is easy -- the '/r' is just extra
> whitespace. 'r' is another case to handle.
>
>
>> My problem with this is that you are reading one huge 1-D array  (that
>> you can resize later) rather than a 2-D array with rows and columns
>> (which is what I deal with).
>
> That's because fromfile()) is not designed to be row-oriented at all,
> and the binary read certainly isn't. I'm just trying to make this easy
> -- though it's not turning out that way!
>
>  > But I agree that you can have an option
>> to say treat '\n' or '\r' as a delimiter but I think it should be
>> turned off by default.
>
> that's what I've done.
>
>> You should have a corresponding value for ints because raising an
>> exceptionwould be inconsistent with allowing floats to have a value.
>
> I'm not sure I care, really -- but I think having the user specify the
> fill value is the best option, anyway.
>
> josef.p...@gmail.com wrote:
 none -- exactly why I think \n is a special case.
>>> What about '\r' and '\n\r'?
>>
>> Yes, I forgot about this, and it will be the most common case for
>> Windows users like myself.
>>
>> I think \r should be stripped automatically, like in non-binary
>> reading of files in python.
>
> except for folks like me that have old mac files laying around...so I
> want this like "Universal newlines" support.
>
>> A warning would be good, but doing np.any(np.isnan(x)) or
>> np.isnan(x).sum() on the result is always a good idea for a user when
>> missing values are possibility.
>
> right, but the issue is the user has to know that they are possible, and
> we all know how carefully we all read docs!
>
> Thanks for your input -- I think I know what I'd like to do, but it's
> proving less than trivial to do it, so we'll see.
>
> In short:
>
> 1) optionally allow newlines to serve as a delimiter, so large tables
> can be read.
>
> 2) raise an exception for missing values, unless:
>   3) the user specifies a fill value of their choice (compatible with
> the chosen data type.
>
>
> -Chris
>
>

I fully agree with your approach!
Thanks for considering my thoughts!

Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-08 Thread Christopher Barker

Bruce Southey wrote:
> Also a user has to check for missing
> values or numpy has to warn a user

I think warnings are next to useless for all but interactive work -- so 
I don't want to rely on them

> that missing values are present
> immediately after reading the data so the appropriate action can be
> taken (like using functions that handle missing values appropriately).
> That is my second problem with using codes (NaN, -9 etc)  for
> missing values.

But I think you're right -- if someone write code, tests with good 
input, then later runs it with missing valued import, they are likely to 
have not ever bothered to test for missing values.

So I think missing values should only be replaced by something if the 
user specifically asks for it.

>> And the principle of fromfile() is that it is fast and simple, if you
>> want masked arrays, use slower, but more full-featured methods.
> 
> So in that case it should fail with missing data.

Well, I'm not so sure -- the point is performance, no reason not to have 
high performing code that handles missing data.

> What about '\r' and '\n\r'?

I have thought about that -- I'm hoping that python's text file reading 
will just take care of it, but as we're working with C file handles here 
(I think), I guess not. '/n/r' is easy -- the '/r' is just extra 
whitespace. 'r' is another case to handle.

> My problem with this is that you are reading one huge 1-D array  (that
> you can resize later) rather than a 2-D array with rows and columns
> (which is what I deal with).

That's because fromfile()) is not designed to be row-oriented at all, 
and the binary read certainly isn't. I'm just trying to make this easy 
-- though it's not turning out that way!

 > But I agree that you can have an option
> to say treat '\n' or '\r' as a delimiter but I think it should be
> turned off by default.

that's what I've done.

> You should have a corresponding value for ints because raising an
> exceptionwould be inconsistent with allowing floats to have a value.

I'm not sure I care, really -- but I think having the user specify the 
fill value is the best option, anyway.

josef.p...@gmail.com wrote:
>>> none -- exactly why I think \n is a special case.
>> What about '\r' and '\n\r'?
> 
> Yes, I forgot about this, and it will be the most common case for
> Windows users like myself.
> 
> I think \r should be stripped automatically, like in non-binary
> reading of files in python.

except for folks like me that have old mac files laying around...so I 
want this like "Universal newlines" support.

> A warning would be good, but doing np.any(np.isnan(x)) or
> np.isnan(x).sum() on the result is always a good idea for a user when
> missing values are possibility.

right, but the issue is the user has to know that they are possible, and 
we all know how carefully we all read docs!

Thanks for your input -- I think I know what I'd like to do, but it's 
proving less than trivial to do it, so we'll see.

In short:

1) optionally allow newlines to serve as a delimiter, so large tables 
can be read.

2) raise an exception for missing values, unless:
   3) the user specifies a fill value of their choice (compatible with 
the chosen data type.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread josef . pktd

On Thu, Jan 7, 2010 at 11:10 PM, Bruce Southey  wrote:
> On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker
>  wrote:
>> Bruce Southey wrote:
  wrote:
>>
>>> Using the numpy NaN or similar (noting R's approach to missing values
>>> which in turn allows it to have the above functionality) is just a
>>> very bad idea for missing values because you always have to check that
>>> which NaN is a missing value and which was due to some numerical
>>> calculation.
>>
>> well, this is specific to reading files, so you know where it came from.
>
> You can only know where it came from when you compare the original
> array to the transformed one. Also a user has to check for missing
> values or numpy has to warn a user that missing values are present
> immediately after reading the data so the appropriate action can be
> taken (like using functions that handle missing values appropriately).
> That is my second problem with using codes (NaN, -9 etc)  for
> missing values.
>
>
>
>> And the principle of fromfile() is that it is fast and simple, if you
>> want masked arrays, use slower, but more full-featured methods.
>
> So in that case it should fail with missing data.
>
>>
>> However, in this case:
>>
>> In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
>> Out[9]: array([  3.,   4.,  NaN,   5.])
>>
>>
>> An actual NaN is read from the file, rather than a missing value.
>> Perhaps the user does want the distinction, so maybe it should really
>> only fil it in if the users asks for it, but specifying
>> "missing_value=np.nan" or something.
>
> Yes, that is my first problem of using predefined codes for missing
> values as you do not always know what is going to occur in the data.
>
>
>>
From what I can see is that you expect that fromfile() should only
>>> split at the supplied delimiters, optionally(?) strip any whitespace
>>
>> whitespace stripping is not optional.
>>
>>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>>> actually assumes multiple delimiters because there is no comma between
>>> 4 and 5 and 8 and 9.
>>
>> Yes, that's the point. I thought about allowing arbitrary multiple
>> delimiters, but I think '/n' is a special case - for instance, a comma
>> at the end of some numbers might mean missing data, but a '\n' would not.
>>
>> And I couldn't really think of a useful use-case for arbitrary multiple
>> delimiters.
>>
>>> In Josef's last case how many 'missing values should there be?
>>
>>  >> extra newlines at end of file
>>  >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>>
>> none -- exactly why I think \n is a special case.
>
> What about '\r' and '\n\r'?

Yes, I forgot about this, and it will be the most common case for
Windows users like myself.

I think \r should be stripped automatically, like in non-binary
reading of files in python.

>
>>
>> What about:
>>  >> extra newlines in the middle of the file
>>  >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
>>
>> I think they should be ignored, but I hope I'm not making something that
>> is too specific to my personal needs.
>
> Not really, it is more that I am being somewhat difficult to ensure I
> understand what you actually need.
>
> My problem with this is that you are reading one huge 1-D array  (that
> you can resize later) rather than a 2-D array with rows and columns
> (which is what I deal with). But I agree that you can have an option
> to say treat '\n' or '\r' as a delimiter but I think it should be
> turned off by default.
>
>
>>
>> Travis Oliphant wrote:
>>> +1 (ignoring new-lines transparently is a nice feature).  You can also
>>> use sscanf with weave to read most files.
>>
>> right -- but that requires weave. In fact, MATLAB has a fscanf function
>> that allows you to pass in a C format string and it vectorizes it to use
>> the same one over an over again until it's done. It's actually quite
>> powerful and flexible. I once started with that in mind, but didn't have
>> the C chops to do it. I ended up with a tool that only did doubles (come
>> to think of it, MATLAB only does doubles, anyway...)
>>
>> I may some day write a whole new C (or, more likely, Cython) function
>> that does something like that, but for now, I'm jsut trying to get
>> fromfile to be useful for me.
>>
>>
>>> +1   (much preferrable to insert NaN or other user value than raise
>>> ValueError in my opinion)
>>
>> But raise an error for integer types?
>>
>> I guess this is still up the air -- no consensus yet.
>>
>> Thanks,
>>
>> -Chris
>>
>
> You should have a corresponding value for ints because raising an
> exceptionwould be inconsistent with allowing floats to have a value.

No, I think different nan/missing value handling between integers and
float is a natural distinction. There is no default nan code for
integers, but nan (and inf) are valid floating point numbers (even if
nan is not a number). And the default treatment of nans in numpy is
getting pretty good (e.g. I like the new (nan)sort).


> If you must keep the

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Bruce Southey

On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker
 wrote:
> Bruce Southey wrote:
>>>  wrote:
>
>> Using the numpy NaN or similar (noting R's approach to missing values
>> which in turn allows it to have the above functionality) is just a
>> very bad idea for missing values because you always have to check that
>> which NaN is a missing value and which was due to some numerical
>> calculation.
>
> well, this is specific to reading files, so you know where it came from.

You can only know where it came from when you compare the original
array to the transformed one. Also a user has to check for missing
values or numpy has to warn a user that missing values are present
immediately after reading the data so the appropriate action can be
taken (like using functions that handle missing values appropriately).
That is my second problem with using codes (NaN, -9 etc)  for
missing values.

> And the principle of fromfile() is that it is fast and simple, if you
> want masked arrays, use slower, but more full-featured methods.

So in that case it should fail with missing data.

>
> However, in this case:
>
> In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
> Out[9]: array([  3.,   4.,  NaN,   5.])
>
>
> An actual NaN is read from the file, rather than a missing value.
> Perhaps the user does want the distinction, so maybe it should really
> only fil it in if the users asks for it, but specifying
> "missing_value=np.nan" or something.

Yes, that is my first problem of using predefined codes for missing
values as you do not always know what is going to occur in the data.

>
>>>From what I can see is that you expect that fromfile() should only
>> split at the supplied delimiters, optionally(?) strip any whitespace
>
> whitespace stripping is not optional.
>
>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>> actually assumes multiple delimiters because there is no comma between
>> 4 and 5 and 8 and 9.
>
> Yes, that's the point. I thought about allowing arbitrary multiple
> delimiters, but I think '/n' is a special case - for instance, a comma
> at the end of some numbers might mean missing data, but a '\n' would not.
>
> And I couldn't really think of a useful use-case for arbitrary multiple
> delimiters.
>
>> In Josef's last case how many 'missing values should there be?
>
>  >> extra newlines at end of file
>  >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> none -- exactly why I think \n is a special case.

What about '\r' and '\n\r'?

>
> What about:
>  >> extra newlines in the middle of the file
>  >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
>
> I think they should be ignored, but I hope I'm not making something that
> is too specific to my personal needs.

Not really, it is more that I am being somewhat difficult to ensure I
understand what you actually need.

My problem with this is that you are reading one huge 1-D array  (that
you can resize later) rather than a 2-D array with rows and columns
(which is what I deal with). But I agree that you can have an option
to say treat '\n' or '\r' as a delimiter but I think it should be
turned off by default.

>
> Travis Oliphant wrote:
>> +1 (ignoring new-lines transparently is a nice feature).  You can also
>> use sscanf with weave to read most files.
>
> right -- but that requires weave. In fact, MATLAB has a fscanf function
> that allows you to pass in a C format string and it vectorizes it to use
> the same one over an over again until it's done. It's actually quite
> powerful and flexible. I once started with that in mind, but didn't have
> the C chops to do it. I ended up with a tool that only did doubles (come
> to think of it, MATLAB only does doubles, anyway...)
>
> I may some day write a whole new C (or, more likely, Cython) function
> that does something like that, but for now, I'm jsut trying to get
> fromfile to be useful for me.
>
>
>> +1   (much preferrable to insert NaN or other user value than raise
>> ValueError in my opinion)
>
> But raise an error for integer types?
>
> I guess this is still up the air -- no consensus yet.
>
> Thanks,
>
> -Chris
>

You should have a corresponding value for ints because raising an
exceptionwould be inconsistent with allowing floats to have a value.
If you must keep the user defined dtype then, as Josef suggests, just
use some code be it -999 or most negative number supported by the OS
for the defined dtype or, just convert the ints into floats if the
user does not define a missing value code.  It would be nice to either
return the number of missing values or display a warning indicating
how many occurred.

Bruce
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Christopher Barker

josef.p...@gmail.com wrote:
>>> +1   (much preferrable to insert NaN or other user value than raise
>>> ValueError in my opinion)

>> But raise an error for integer types?
>>
>> I guess this is still up the air -- no consensus yet.
> 
> raise an exception, I hate the silent cast of nan to integer zero,

me too -- I'm sorry, I wasn't clear -- I'm not going to write any code 
that returns a zero for a missing value. These are the options I'd consider:

1) Have the user specify what to use for missing values, otherwise, 
raise an exception

2) Insert a NaN for floating points types, and raise an exception for 
integer types.

what's not clear is whether (2) is a good idea. As for (1), I just don't 
know if I'm going to get around to writing the code, and I maybe more 
kwargs is a bad idea -- though maybe not.

Enough talk: I've got ugly C code to wade through...

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread josef . pktd

On Thu, Jan 7, 2010 at 4:45 PM, Christopher Barker
 wrote:
> Bruce Southey wrote:
>>>  wrote:
>
>> Using the numpy NaN or similar (noting R's approach to missing values
>> which in turn allows it to have the above functionality) is just a
>> very bad idea for missing values because you always have to check that
>> which NaN is a missing value and which was due to some numerical
>> calculation.
>
> well, this is specific to reading files, so you know where it came from.
> And the principle of fromfile() is that it is fast and simple, if you
> want masked arrays, use slower, but more full-featured methods.
>
> However, in this case:
>
> In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
> Out[9]: array([  3.,   4.,  NaN,   5.])
>
>
> An actual NaN is read from the file, rather than a missing value.
> Perhaps the user does want the distinction, so maybe it should really
> only fil it in if the users asks for it, but specifying
> "missing_value=np.nan" or something.
>
>>>From what I can see is that you expect that fromfile() should only
>> split at the supplied delimiters, optionally(?) strip any whitespace
>
> whitespace stripping is not optional.
>
>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>> actually assumes multiple delimiters because there is no comma between
>> 4 and 5 and 8 and 9.
>
> Yes, that's the point. I thought about allowing arbitrary multiple
> delimiters, but I think '/n' is a special case - for instance, a comma
> at the end of some numbers might mean missing data, but a '\n' would not.
>
> And I couldn't really think of a useful use-case for arbitrary multiple
> delimiters.
>
>> In Josef's last case how many 'missing values should there be?
>
>  >> extra newlines at end of file
>  >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> none -- exactly why I think \n is a special case.
>
> What about:
>  >> extra newlines in the middle of the file
>  >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
>
> I think they should be ignored, but I hope I'm not making something that
> is too specific to my personal needs.
>
> Travis Oliphant wrote:
>> +1 (ignoring new-lines transparently is a nice feature).  You can also
>> use sscanf with weave to read most files.
>
> right -- but that requires weave. In fact, MATLAB has a fscanf function
> that allows you to pass in a C format string and it vectorizes it to use
> the same one over an over again until it's done. It's actually quite
> powerful and flexible. I once started with that in mind, but didn't have
> the C chops to do it. I ended up with a tool that only did doubles (come
> to think of it, MATLAB only does doubles, anyway...)
>
> I may some day write a whole new C (or, more likely, Cython) function
> that does something like that, but for now, I'm jsut trying to get
> fromfile to be useful for me.
>
>
>> +1   (much preferrable to insert NaN or other user value than raise
>> ValueError in my opinion)
>
> But raise an error for integer types?
>
> I guess this is still up the air -- no consensus yet.

raise an exception, I hate the silent cast of nan to integer zero, too
much debugging and useless if there are real zeros.
(or use some -999 kind of thing if user defined nan codes are allowed,
but I just work with float if I expect nans/missing values.)

Josef

>
> Thanks,
>
> -Chris
>
>
>
>
>
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Christopher Barker

Bruce Southey wrote:
>>  wrote:

> Using the numpy NaN or similar (noting R's approach to missing values
> which in turn allows it to have the above functionality) is just a
> very bad idea for missing values because you always have to check that
> which NaN is a missing value and which was due to some numerical
> calculation.

well, this is specific to reading files, so you know where it came from. 
And the principle of fromfile() is that it is fast and simple, if you 
want masked arrays, use slower, but more full-featured methods.

However, in this case:

In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
Out[9]: array([  3.,   4.,  NaN,   5.])

An actual NaN is read from the file, rather than a missing value. 
Perhaps the user does want the distinction, so maybe it should really 
only fil it in if the users asks for it, but specifying 
"missing_value=np.nan" or something.

>>From what I can see is that you expect that fromfile() should only
> split at the supplied delimiters, optionally(?) strip any whitespace

whitespace stripping is not optional.

> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
> actually assumes multiple delimiters because there is no comma between
> 4 and 5 and 8 and 9.

Yes, that's the point. I thought about allowing arbitrary multiple 
delimiters, but I think '/n' is a special case - for instance, a comma 
at the end of some numbers might mean missing data, but a '\n' would not.

And I couldn't really think of a useful use-case for arbitrary multiple 
delimiters.

> In Josef's last case how many 'missing values should there be?

 >> extra newlines at end of file
 >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

none -- exactly why I think \n is a special case.

What about:
 >> extra newlines in the middle of the file
 >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'

I think they should be ignored, but I hope I'm not making something that 
is too specific to my personal needs.

Travis Oliphant wrote:
> +1 (ignoring new-lines transparently is a nice feature).  You can also  
> use sscanf with weave to read most files.

right -- but that requires weave. In fact, MATLAB has a fscanf function 
that allows you to pass in a C format string and it vectorizes it to use 
the same one over an over again until it's done. It's actually quite 
powerful and flexible. I once started with that in mind, but didn't have 
the C chops to do it. I ended up with a tool that only did doubles (come 
to think of it, MATLAB only does doubles, anyway...)

I may some day write a whole new C (or, more likely, Cython) function 
that does something like that, but for now, I'm jsut trying to get 
fromfile to be useful for me.

> +1   (much preferrable to insert NaN or other user value than raise  
> ValueError in my opinion)

But raise an error for integer types?

I guess this is still up the air -- no consensus yet.

Thanks,

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Travis Oliphant


On Jan 7, 2010, at 2:32 PM, josef.p...@gmail.com wrote:

> On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
>  wrote:
>> Pauli Virtanen wrote:
>>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
>>> it also does odd things with spaces
 embedded in the separator:

 ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>>
>>> That's a documented feature:
>>
>> Fair enough.
>>
>> OK, I've written a patch that allows newlines to be interpreted as
>> separators in addition to whatever is specified in sep.
>>
>> In the process of testing, I found again these issues, which are  
>> still
>> marked as "needs decision".
>>
>> http://projects.scipy.org/numpy/ticket/883
>>
>> In short: what to do with missing values?
>>
>> I'd like to address this bug, but I need a decision to do so.
>>
>>
>> My proposal:
>>
>> Raise an ValueError with missing values.
>>
>>
>> Justification:
>>
>> No function should EVER return data that is not there. Period. It is
>> simply asking for hard to find bugs. Therefore:
>>
>> fromstring("3, 4,,5", sep=",")
>>
>> Should never, ever, return:
>>
>> array([ 3.,  4.,  0.,  5.])
>>
>> Which is what it does now. bad. bad. bad.
>>
>>
>>
>>
>> Alternatives:
>>
>>   A) Raising a ValueError is the easiest way to get "proper"  
>> behavior.
>> Folks can use a more sophisticated file reader if they want missing
>> values handled. I'm willing to contribute this patch.
>>
>>   B) If the dtype is a floating point type, NaN could fill in the
>> missing values -- a fine idea, but you can't use it for integers, and
>> zero is a really bad replacement!
>>
>>   C) The user could specify what they want filled in for missing
>> values. This is a fine idea, though I'm not sure I want to take the  
>> time
>> to impliment it.
>>
>> Oh, and this is a bug too, with probably the same solution:
>>
>> In [20]: np.fromstring("hjba", sep=',')
>> Out[20]: array([ 0.])
>>
>> In [26]: np.fromstring("34gytf39", sep=',')
>> Out[26]: array([ 34.])
>>
>>
>> One more unresolved question:
>>
>> what should:
>>
>> np.fromstring("3, 4, 5,", sep=",")
>>
>> return?
>>
>> it currently returns:
>>
>> array([ 3.,  4.,  5.])
>>
>> which seems a bit inconsitent with missing value handling. I also  
>> found
>> a bug:
>>
>> In [6]: np.fromstring("3, 4, 5 , ", sep=",")
>> Out[6]: array([ 3.,  4.,  5.,  0.])
>>
>> so if there is some extra whitespace in there, it does return a  
>> missing
>> value. With my proposal, that wouldn't happen, but you might get an
>> exception. I think you should, but it'll be easier to implement my
>> "allow newlines" code if not.
>>
>>
>> so, should I do (A) ?
>>
>>
>> Another question:
>>
>> I've got a patch mostly working (except for the above issues) that  
>> will
>> allow fromfile/string to read multiline non-whitespace separated  
>> data in
>> one shot:
>>
>>
>> In [15]: str
>> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>>
>> In [16]: np.fromstring(str, sep=',', allow_newlines=True)
>> Out[16]:
>> array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,   
>> 11.,
>> 12.])
>>
>>
>> I think this is a very helpful enhancement, and, as it is a new  
>> kwarg,
>> backward compatible:
>>
>> 1) Might it be accepted for inclusion?
>>
>> 2) Is the name for the flag OK: "allow_newlines"? It's pretty  
>> explicit,
>> but also long -- I used it for the flag name in the C code, too.
>>
>> 3) What C datatype should I use for a boolean flag? I used a char,  
>> but I
>> don't know what the numpy standard is.
>>
>>
>> -Chris
>>
>>
>
> I don't know much about this, just a few more test cases
>
> comma and newline
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'
>
> extra comma at end of file
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'
>
> extra newlines at end of file
> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> It would be nice if these cases would go through without missing
> values or exception, but I don't often have files that are clean
> enough for fromfile().

+1 (ignoring new-lines transparently is a nice feature).  You can also  
use sscanf with weave to read most files.

>
> I'm in favor of nan for missing values with floating point numbers. It
> would make it easy to read correctly formatted csv files, even if the
> data is not complete.

+1   (much preferrable to insert NaN or other user value than raise  
ValueError in my opinion)

-Travis

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Bruce Southey

On Thu, Jan 7, 2010 at 2:32 PM,   wrote:
> On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
>  wrote:
>> Pauli Virtanen wrote:
>>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
>>> it also does odd things with spaces
 embedded in the separator:

 ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>>
>>> That's a documented feature:
>>
>> Fair enough.
>>
>> OK, I've written a patch that allows newlines to be interpreted as
>> separators in addition to whatever is specified in sep.
>>
>> In the process of testing, I found again these issues, which are still
>> marked as "needs decision".
>>
>> http://projects.scipy.org/numpy/ticket/883
>>
>> In short: what to do with missing values?
>>
>> I'd like to address this bug, but I need a decision to do so.
>>
>>
>> My proposal:
>>
>> Raise an ValueError with missing values.
>>
>>
>> Justification:
>>
>> No function should EVER return data that is not there. Period. It is
>> simply asking for hard to find bugs. Therefore:
>>
>> fromstring("3, 4,,5", sep=",")
>>
>> Should never, ever, return:
>>
>> array([ 3.,  4.,  0.,  5.])
>>
>> Which is what it does now. bad. bad. bad.
>>
>>
>>
>>
>> Alternatives:
>>
>>   A) Raising a ValueError is the easiest way to get "proper" behavior.
>> Folks can use a more sophisticated file reader if they want missing
>> values handled. I'm willing to contribute this patch.
>>
>>   B) If the dtype is a floating point type, NaN could fill in the
>> missing values -- a fine idea, but you can't use it for integers, and
>> zero is a really bad replacement!
>>
>>   C) The user could specify what they want filled in for missing
>> values. This is a fine idea, though I'm not sure I want to take the time
>> to impliment it.
>>
>> Oh, and this is a bug too, with probably the same solution:
>>
>> In [20]: np.fromstring("hjba", sep=',')
>> Out[20]: array([ 0.])
>>
>> In [26]: np.fromstring("34gytf39", sep=',')
>> Out[26]: array([ 34.])
>>
>>
>> One more unresolved question:
>>
>> what should:
>>
>> np.fromstring("3, 4, 5,", sep=",")
>>
>> return?
>>
>> it currently returns:
>>
>> array([ 3.,  4.,  5.])
>>
>> which seems a bit inconsitent with missing value handling. I also found
>> a bug:
>>
>> In [6]: np.fromstring("3, 4, 5 , ", sep=",")
>> Out[6]: array([ 3.,  4.,  5.,  0.])
>>
>> so if there is some extra whitespace in there, it does return a missing
>> value. With my proposal, that wouldn't happen, but you might get an
>> exception. I think you should, but it'll be easier to implement my
>> "allow newlines" code if not.
>>
>>
>> so, should I do (A) ?
>>
>>
>> Another question:
>>
>> I've got a patch mostly working (except for the above issues) that will
>> allow fromfile/string to read multiline non-whitespace separated data in
>> one shot:
>>
>>
>> In [15]: str
>> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>>
>> In [16]: np.fromstring(str, sep=',', allow_newlines=True)
>> Out[16]:
>> array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
>>         12.])
>>
>>
>> I think this is a very helpful enhancement, and, as it is a new kwarg,
>> backward compatible:
>>
>> 1) Might it be accepted for inclusion?
>>
>> 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit,
>> but also long -- I used it for the flag name in the C code, too.
>>
>> 3) What C datatype should I use for a boolean flag? I used a char, but I
>> don't know what the numpy standard is.
>>
>>
>> -Chris
>>
>>
>
> I don't know much about this, just a few more test cases
>
> comma and newline
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'
>
> extra comma at end of file
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'
>
> extra newlines at end of file
> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> It would be nice if these cases would go through without missing
> values or exception, but I don't often have files that are clean
> enough for fromfile().
>
> I'm in favor of nan for missing values with floating point numbers. It
> would make it easy to read correctly formatted csv files, even if the
> data is not complete.
>


Using the numpy NaN or similar (noting R's approach to missing values
which in turn allows it to have the above functionality) is just a
very bad idea for missing values because you always have to check that
which NaN is a missing value and which was due to some numerical
calculation. It is a very bad idea because we have masked arrays that
nicely but slowly handle this situation.

>From what I can see is that you expect that fromfile() should only
split at the supplied delimiters, optionally(?) strip any whitespace
and force a specific dtype. I would agree that the failure of any of
one these should create an exception by default rather than making the
best guess. So 'missing data'  would potentially fail with forcing the
specified dtype. Thus, you should either create an exception for
invalid data (with appropriate location) or use masked arrays.

Your output from this string '1, 2

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread josef . pktd

On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
 wrote:
> Pauli Virtanen wrote:
>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
>> it also does odd things with spaces
>>> embedded in the separator:
>>>
>>> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>
>> That's a documented feature:
>
> Fair enough.
>
> OK, I've written a patch that allows newlines to be interpreted as
> separators in addition to whatever is specified in sep.
>
> In the process of testing, I found again these issues, which are still
> marked as "needs decision".
>
> http://projects.scipy.org/numpy/ticket/883
>
> In short: what to do with missing values?
>
> I'd like to address this bug, but I need a decision to do so.
>
>
> My proposal:
>
> Raise an ValueError with missing values.
>
>
> Justification:
>
> No function should EVER return data that is not there. Period. It is
> simply asking for hard to find bugs. Therefore:
>
> fromstring("3, 4,,5", sep=",")
>
> Should never, ever, return:
>
> array([ 3.,  4.,  0.,  5.])
>
> Which is what it does now. bad. bad. bad.
>
>
>
>
> Alternatives:
>
>   A) Raising a ValueError is the easiest way to get "proper" behavior.
> Folks can use a more sophisticated file reader if they want missing
> values handled. I'm willing to contribute this patch.
>
>   B) If the dtype is a floating point type, NaN could fill in the
> missing values -- a fine idea, but you can't use it for integers, and
> zero is a really bad replacement!
>
>   C) The user could specify what they want filled in for missing
> values. This is a fine idea, though I'm not sure I want to take the time
> to impliment it.
>
> Oh, and this is a bug too, with probably the same solution:
>
> In [20]: np.fromstring("hjba", sep=',')
> Out[20]: array([ 0.])
>
> In [26]: np.fromstring("34gytf39", sep=',')
> Out[26]: array([ 34.])
>
>
> One more unresolved question:
>
> what should:
>
> np.fromstring("3, 4, 5,", sep=",")
>
> return?
>
> it currently returns:
>
> array([ 3.,  4.,  5.])
>
> which seems a bit inconsitent with missing value handling. I also found
> a bug:
>
> In [6]: np.fromstring("3, 4, 5 , ", sep=",")
> Out[6]: array([ 3.,  4.,  5.,  0.])
>
> so if there is some extra whitespace in there, it does return a missing
> value. With my proposal, that wouldn't happen, but you might get an
> exception. I think you should, but it'll be easier to implement my
> "allow newlines" code if not.
>
>
> so, should I do (A) ?
>
>
> Another question:
>
> I've got a patch mostly working (except for the above issues) that will
> allow fromfile/string to read multiline non-whitespace separated data in
> one shot:
>
>
> In [15]: str
> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>
> In [16]: np.fromstring(str, sep=',', allow_newlines=True)
> Out[16]:
> array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
>         12.])
>
>
> I think this is a very helpful enhancement, and, as it is a new kwarg,
> backward compatible:
>
> 1) Might it be accepted for inclusion?
>
> 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit,
> but also long -- I used it for the flag name in the C code, too.
>
> 3) What C datatype should I use for a boolean flag? I used a char, but I
> don't know what the numpy standard is.
>
>
> -Chris
>
>

I don't know much about this, just a few more test cases

comma and newline
str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'

extra comma at end of file
str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'

extra newlines at end of file
str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

It would be nice if these cases would go through without missing
values or exception, but I don't often have files that are clean
enough for fromfile().

I'm in favor of nan for missing values with floating point numbers. It
would make it easy to read correctly formatted csv files, even if the
data is not complete.

Josef

>
>
>
>
>
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-07 Thread Christopher Barker

Pauli Virtanen wrote:
> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
> it also does odd things with spaces 
>> embedded in the separator:
>>
>> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"

> That's a documented feature:

Fair enough.

OK, I've written a patch that allows newlines to be interpreted as 
separators in addition to whatever is specified in sep.

In the process of testing, I found again these issues, which are still 
marked as "needs decision".

http://projects.scipy.org/numpy/ticket/883

In short: what to do with missing values?

I'd like to address this bug, but I need a decision to do so.

My proposal:

Raise an ValueError with missing values.

Justification:

No function should EVER return data that is not there. Period. It is 
simply asking for hard to find bugs. Therefore:

fromstring("3, 4,,5", sep=",")

Should never, ever, return:

array([ 3.,  4.,  0.,  5.])

Which is what it does now. bad. bad. bad.

Alternatives:

   A) Raising a ValueError is the easiest way to get "proper" behavior. 
Folks can use a more sophisticated file reader if they want missing 
values handled. I'm willing to contribute this patch.

   B) If the dtype is a floating point type, NaN could fill in the 
missing values -- a fine idea, but you can't use it for integers, and 
zero is a really bad replacement!

   C) The user could specify what they want filled in for missing 
values. This is a fine idea, though I'm not sure I want to take the time 
to impliment it.

Oh, and this is a bug too, with probably the same solution:

In [20]: np.fromstring("hjba", sep=',')
Out[20]: array([ 0.])

In [26]: np.fromstring("34gytf39", sep=',')
Out[26]: array([ 34.])

One more unresolved question:

what should:

np.fromstring("3, 4, 5,", sep=",")

return?

it currently returns:

array([ 3.,  4.,  5.])

which seems a bit inconsitent with missing value handling. I also found 
a bug:

In [6]: np.fromstring("3, 4, 5 , ", sep=",")
Out[6]: array([ 3.,  4.,  5.,  0.])

so if there is some extra whitespace in there, it does return a missing 
value. With my proposal, that wouldn't happen, but you might get an 
exception. I think you should, but it'll be easier to implement my 
"allow newlines" code if not.

so, should I do (A) ?

Another question:

I've got a patch mostly working (except for the above issues) that will 
allow fromfile/string to read multiline non-whitespace separated data in 
one shot:

In [15]: str
Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'

In [16]: np.fromstring(str, sep=',', allow_newlines=True)
Out[16]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
 12.])

I think this is a very helpful enhancement, and, as it is a new kwarg, 
backward compatible:

1) Might it be accepted for inclusion?

2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit, 
but also long -- I used it for the flag name in the C code, too.

3) What C datatype should I use for a boolean flag? I used a char, but I 
don't know what the numpy standard is.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Pauli Virtanen

ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
[clip]
> I also notice that it supports separators of arbitrary length, which I 
> wonder how useful that is. But it also does odd things with spaces 
> embedded in the separator:
> 
> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
> 
> Is it worth trying to fix that?

That's a documented feature:

sep : str
Separator between items if file is a text file.
Empty ("") separator means the file should be treated as binary.
Spaces (" ") in the separator match zero or more whitespace
characters. A separator consisting only of spaces must match at
least one whitespace.

-- 
Pauli Virtanen



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Pierre GM

On Jan 5, 2010, at 12:32 PM, Christopher Barker wrote:
> josef.p...@gmail.com wrote:
>> On Mon, Jan 4, 2010 at 10:39 PM,   wrote:
>>> I rather like the R command(s) for reading text files
> 
>> Aren't the newly improved
>> 
>> numpy.genfromtxt()
> 
> ...
> 
>> and friends indented to handle all this
> 
> Yes, they are, and they are great, but not really all that fast. If 
> you've got big complicated tables of data to read, then genfromtxt is 
> the way to go -- it's a great tool. However, for the simple stuff, it's 
> not really optimized.

genfromtxt is nothing but loadtxt overloaded to deal with undefined dtype and 
missing entries. It's doomed to be slower, and it shouldn't be used if you know 
your data is well-defined and well-behaved. Stick to loadtxt

> I also find I have to read a lot of text files 
> that aren't tables of data, but rather an odd mix of stuff, but still a 
> lot of reading lots of numbers from a file.

Well, everything depends on what kind of stuff you have in your mix, I guess...

> so fromfile() is 3.5 times as fast as loadtxt and 4.5 times as fast as 
> genfromtxt. That does make a difference for me -- the user waiting 4 
> seconds, rather than one second to load a file matters.

Rmmbr that fromfile is C when loadtxt and genfromtxt are Python...

> I suppose another option might be to see if I can optimize the inner 
> scanning function of genfromtxt with Cython or C, but I'm not sure 
> that's possible, as it's really very flexible, and re-writing all of 
> that without Python would be really painful!


Well, there's room for some optimization for particular cases (dtype!=None), 
but the generic case will be tricky...


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Christopher Barker

josef.p...@gmail.com wrote:
> On Mon, Jan 4, 2010 at 10:39 PM,   wrote:
>> I rather like the R command(s) for reading text files

> Aren't the newly improved
> 
> numpy.genfromtxt()

...

> and friends indented to handle all this

Yes, they are, and they are great, but not really all that fast. If 
you've got big complicated tables of data to read, then genfromtxt is 
the way to go -- it's a great tool. However, for the simple stuff, it's 
not really optimized. I also find I have to read a lot of text files 
that aren't tables of data, but rather an odd mix of stuff, but still a 
lot of reading lots of numbers from a file. As far as I can tell, 
genfromtxt and loadtxt can only load the entire file as a table (a very 
common situation, of course).

Paul Ivanov wrote:
> Just a potshot, but have you tried np.loadtxt?
> 
> I find it pretty fast.

I guess I should have posted timings in the first place:

In [19]: timeit timing.time_genfromtxt()
10 loops, best of 3: 216 ms per loop

In [20]: timeit timing.time_loadtxt()
10 loops, best of 3: 166 ms per loop

In [21]: timeit timing.time_fromfile()
10 loops, best of 3: 47.1 ms per loop

(40,000 doubles from a space-delimted text file)

so fromfile() is 3.5 times as fast as loadtxt and 4.5 times as fast as 
genfromtxt. That does make a difference for me -- the user waiting 4 
seconds, rather than one second to load a file matters.

I suppose another option might be to see if I can optimize the inner 
scanning function of genfromtxt with Cython or C, but I'm not sure 
that's possible, as it's really very flexible, and re-writing all of 
that without Python would be really painful!

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-05 Thread Paul Ivanov

Christopher Barker, on 2010-01-04 17:05, wrote:
> Hi folks,
> 
> I'm taking a look once again at fromfile() for reading text files. I 
> often have the need to read a LOT of numbers form a text file, and it 
> can actually be pretty darn slow do i the normal python way:
> 
> for line in file:
> data = map(float, line.strip().split())
> 
> 
> or various other versions that are similar. It really does take longer 
> to read the text, split it up, convert to a number, then put that number 
> into a numpy array, than it does to simply read it straight into the array.
> 
> However, as it stands, fromfile() turn out to be next to useless for 
> anything but whitespace separated text. Full set of ideas here:
> 
> http://projects.scipy.org/numpy/ticket/909
> 
> However, for the moment, I'm digging into the code to address a 
> particular problem -- reading files like this:
> 
> 123, 65.6, 789
> 23,  3.2,  34
> ...
> 
> That is comma (or whatever) separated text -- pretty common stuff.
> 
> The problem with the current code is that you can't read more than one 
> line at time with fromfile:
> 
> a = np.fromfile(infile, sep=",")
> 
> will read until it doesn't find a comma, and thus only one line, as 
> there is no comma after each line. As this is a really typical case, I 
> think it should be supported.

Just a potshot, but have you tried np.loadtxt?

I find it pretty fast.

> 
> Here is the question:
> 
> The work of finding the separator is done in:
> 
> multiarray/ctors.c:  fromfile_skip_separator()
> 
> It looks like it wouldn't be too hard to add some code in there to look 
> for a newline, and consider that a valid separator. However, that would 
> break backward compatibility. So maybe a flag could be passed in, saying 
> you wanted to support newlines. The problem is that flag would have to 
> get passed all the way through to this function (and also for fromstring).
> 
> I also notice that it supports separators of arbitrary length, which I 
> wonder how useful that is. But it also does odd things with spaces 
> embedded in the separator:
> 
> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
> 
> Is it worth trying to fix that?
> 
> 
> In the longer term, it would be really nice to support comments as well, 
> tough that would require more of a re-factoring of the code, I think 
> (though maybe not -- I suppose a call to fromfile_skip_separator() could 
> look for a comment character, then if it found one, skip to where the 
> comment ends -- hmmm.
> 
> thanks for any feedback,
> 
> -Chris
> 
> 
> 
> 
> 
> 
> 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-04 Thread josef . pktd

On Mon, Jan 4, 2010 at 10:39 PM,   wrote:
>>Hi folks,
>>
>>I'm taking a look once again at fromfile() for reading text files. I
>>often have the need to read a LOT of numbers form a text file, and it
>>can actually be pretty darn slow do i the normal python way:
>>
>>for line in file:
>>    data = map(float, line.strip().split())
>>
>>
>>or various other versions that are similar. It really does take longer
>>to read the text, split it up, convert to a number, then put that number
>>into a numpy array, than it does to simply read it straight into the array.
>>
>>However, as it stands, fromfile() turn out to be next to useless for
>>anything but whitespace separated text. Full set of ideas here:
>>
>>http://projects.scipy.org/numpy/ticket/909
>>
>>However, for the moment, I'm digging into the code to address a
>>particular problem -- reading files like this:
>>
>>123, 65.6, 789
>>23,  3.2,  34
>>...
>>
>>That is comma (or whatever) separated text -- pretty common stuff.
>>
>>The problem with the current code is that you can't read more than one
>>line at time with fromfile:
>>
>>a = np.fromfile(infile, sep=",")
>>
>>will read until it doesn't find a comma, and thus only one line, as
>>there is no comma after each line. As this is a really typical case, I
>>think it should be supported.
>>
>>Here is the question:
>>
>>The work of finding the separator is done in:
>>
>>multiarray/ctors.c:  fromfile_skip_separator()
>>
>>It looks like it wouldn't be too hard to add some code in there to look
>>for a newline, and consider that a valid separator. However, that would
>>break backward compatibility. So maybe a flag could be passed in, saying
>>you wanted to support newlines. The problem is that flag would have to
>>get passed all the way through to this function (and also for fromstring).
>>
>>I also notice that it supports separators of arbitrary length, which I
>>wonder how useful that is. But it also does odd things with spaces
>>embedded in the separator:
>>
>>", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>>
>>Is it worth trying to fix that?
>>
>>
>>In the longer term, it would be really nice to support comments as well,
>>tough that would require more of a re-factoring of the code, I think
>>(though maybe not -- I suppose a call to fromfile_skip_separator() could
>>look for a comment character, then if it found one, skip to where the
>>comment ends -- hmmm.
>>
>>thanks for any feedback,
>>
>>-Chris
>>
>
> I agree. I've tried using it, and usually find that it doesn't quite get 
> there.
>
> I rather like the R command(s) for reading text files - except then I have to
> use R which is painful after using python and numpy. Although ggplot2 is
> awfully nice too ... but that is a later post.
>
>     read.table(file, header = FALSE, sep = "", quote = "\"'",
>                dec = ".", row.names, col.names,
>                as.is = !stringsAsFactors,
>                na.strings = "NA", colClasses = NA, nrows = -1,
>                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
>                strip.white = FALSE, blank.lines.skip = TRUE,
>                comment.char = "#",
>                allowEscapes = FALSE, flush = FALSE,
>                stringsAsFactors = default.stringsAsFactors(),
>                fileEncoding = "", encoding = "unknown")
>
>     read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".",
>              fill = TRUE, comment.char="", ...)
>
>     read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",",
>               fill = TRUE, comment.char="", ...)
>
>     read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".",
>                fill = TRUE, comment.char="", ...)
>
>     read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",",
>                 fill = TRUE, comment.char="", ...)
>
>
> There is really only read.table, the others are just aliases with different
> defaults.  But the flexibility is great, as you can see.


Aren't the newly improved

numpy.genfromtxt(fname, dtype=, comments='#',
delimiter=None, skiprows=0, converters=None, missing='',
missing_values=None, usecols=None, names=None, excludelist=None,
deletechars=None, case_sensitive=True, unpack=None, usemask=False,
loose=True)

and friends indented to handle all this

Josef

>
> --
> ---
> | Alan K. Jackson            | To see a World in a Grain of Sand      |
> | a...@ajackson.org          | And a Heaven in a Wild Flower,         |
> | www.ajackson.org           | Hold Infinity in the palm of your hand |
> | Houston, Texas             | And Eternity in an hour. - Blake       |
> ---
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
h

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

2010-01-04 Thread alan

>Hi folks,
>
>I'm taking a look once again at fromfile() for reading text files. I 
>often have the need to read a LOT of numbers form a text file, and it 
>can actually be pretty darn slow do i the normal python way:
>
>for line in file:
>data = map(float, line.strip().split())
>
>
>or various other versions that are similar. It really does take longer 
>to read the text, split it up, convert to a number, then put that number 
>into a numpy array, than it does to simply read it straight into the array.
>
>However, as it stands, fromfile() turn out to be next to useless for 
>anything but whitespace separated text. Full set of ideas here:
>
>http://projects.scipy.org/numpy/ticket/909
>
>However, for the moment, I'm digging into the code to address a 
>particular problem -- reading files like this:
>
>123, 65.6, 789
>23,  3.2,  34
>...
>
>That is comma (or whatever) separated text -- pretty common stuff.
>
>The problem with the current code is that you can't read more than one 
>line at time with fromfile:
>
>a = np.fromfile(infile, sep=",")
>
>will read until it doesn't find a comma, and thus only one line, as 
>there is no comma after each line. As this is a really typical case, I 
>think it should be supported.
>
>Here is the question:
>
>The work of finding the separator is done in:
>
>multiarray/ctors.c:  fromfile_skip_separator()
>
>It looks like it wouldn't be too hard to add some code in there to look 
>for a newline, and consider that a valid separator. However, that would 
>break backward compatibility. So maybe a flag could be passed in, saying 
>you wanted to support newlines. The problem is that flag would have to 
>get passed all the way through to this function (and also for fromstring).
>
>I also notice that it supports separators of arbitrary length, which I 
>wonder how useful that is. But it also does odd things with spaces 
>embedded in the separator:
>
>", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>
>Is it worth trying to fix that?
>
>
>In the longer term, it would be really nice to support comments as well, 
>tough that would require more of a re-factoring of the code, I think 
>(though maybe not -- I suppose a call to fromfile_skip_separator() could 
>look for a comment character, then if it found one, skip to where the 
>comment ends -- hmmm.
>
>thanks for any feedback,
>
>-Chris
>

I agree. I've tried using it, and usually find that it doesn't quite get there.

I rather like the R command(s) for reading text files - except then I have to
use R which is painful after using python and numpy. Although ggplot2 is
awfully nice too ... but that is a later post.

 read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown")
 
 read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".",
  fill = TRUE, comment.char="", ...)
 
 read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",",
   fill = TRUE, comment.char="", ...)
 
 read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".",
fill = TRUE, comment.char="", ...)
 
 read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",",
 fill = TRUE, comment.char="", ...)


There is really only read.table, the others are just aliases with different
defaults.  But the flexibility is great, as you can see.

-- 
---
| Alan K. Jackson| To see a World in a Grain of Sand  |
| a...@ajackson.org  | And a Heaven in a Wild Flower, |
| www.ajackson.org   | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake   |
---
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

Re: [Numpy-discussion] fromfile() for reading text (one more time!)

22 matches

Site Navigation

Mail list logo

Footer information