On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker <chris.bar...@noaa.gov> wrote: > Bruce Southey wrote: >>> <chris.bar...@noaa.gov> wrote: > >> Using the numpy NaN or similar (noting R's approach to missing values >> which in turn allows it to have the above functionality) is just a >> very bad idea for missing values because you always have to check that >> which NaN is a missing value and which was due to some numerical >> calculation. > > well, this is specific to reading files, so you know where it came from.
You can only know where it came from when you compare the original array to the transformed one. Also a user has to check for missing values or numpy has to warn a user that missing values are present immediately after reading the data so the appropriate action can be taken (like using functions that handle missing values appropriately). That is my second problem with using codes (NaN, -99999 etc) for missing values. > And the principle of fromfile() is that it is fast and simple, if you > want masked arrays, use slower, but more full-featured methods. So in that case it should fail with missing data. > > However, in this case: > > In [9]: np.fromstring("3, 4, NaN, 5", sep=",") > Out[9]: array([ 3., 4., NaN, 5.]) > > > An actual NaN is read from the file, rather than a missing value. > Perhaps the user does want the distinction, so maybe it should really > only fil it in if the users asks for it, but specifying > "missing_value=np.nan" or something. Yes, that is my first problem of using predefined codes for missing values as you do not always know what is going to occur in the data. > >>>From what I can see is that you expect that fromfile() should only >> split at the supplied delimiters, optionally(?) strip any whitespace > > whitespace stripping is not optional. > >> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >> actually assumes multiple delimiters because there is no comma between >> 4 and 5 and 8 and 9. > > Yes, that's the point. I thought about allowing arbitrary multiple > delimiters, but I think '/n' is a special case - for instance, a comma > at the end of some numbers might mean missing data, but a '\n' would not. > > And I couldn't really think of a useful use-case for arbitrary multiple > delimiters. > >> In Josef's last case how many 'missing values should there be? > > >> extra newlines at end of file > >> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' > > none -- exactly why I think \n is a special case. What about '\r' and '\n\r'? > > What about: > >> extra newlines in the middle of the file > >> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' > > I think they should be ignored, but I hope I'm not making something that > is too specific to my personal needs. Not really, it is more that I am being somewhat difficult to ensure I understand what you actually need. My problem with this is that you are reading one huge 1-D array (that you can resize later) rather than a 2-D array with rows and columns (which is what I deal with). But I agree that you can have an option to say treat '\n' or '\r' as a delimiter but I think it should be turned off by default. > > Travis Oliphant wrote: >> +1 (ignoring new-lines transparently is a nice feature). You can also >> use sscanf with weave to read most files. > > right -- but that requires weave. In fact, MATLAB has a fscanf function > that allows you to pass in a C format string and it vectorizes it to use > the same one over an over again until it's done. It's actually quite > powerful and flexible. I once started with that in mind, but didn't have > the C chops to do it. I ended up with a tool that only did doubles (come > to think of it, MATLAB only does doubles, anyway...) > > I may some day write a whole new C (or, more likely, Cython) function > that does something like that, but for now, I'm jsut trying to get > fromfile to be useful for me. > > >> +1 (much preferrable to insert NaN or other user value than raise >> ValueError in my opinion) > > But raise an error for integer types? > > I guess this is still up the air -- no consensus yet. > > Thanks, > > -Chris > You should have a corresponding value for ints because raising an exceptionwould be inconsistent with allowing floats to have a value. If you must keep the user defined dtype then, as Josef suggests, just use some code be it -999 or most negative number supported by the OS for the defined dtype or, just convert the ints into floats if the user does not define a missing value code. It would be nice to either return the number of missing values or display a warning indicating how many occurred. Bruce _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion