Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Sat, Jan 23, 2010 at 1:53 PM, Christopher Barker wrote: > a...@ajackson.org wrote: >> it doesn't support the reasonable >> (IMHO) behavior of treating quote delimited strings in the input file as a >> single field. > > I'd use the csv module for that. > > Which makes me wonder if it might make sense to build some of the numpy > table-reading stuff on top of it... > > -Chris csv was also my standard module for this, it handles csv dialects and unicode (with some detour), but having automatic conversion in genfromtext is nicer. >>> reader = csv.reader(open(r'C:\Josef\work-oth\testdata.csv','rb'), >>> delimiter=' ') >>> for line in reader: ... print line ... ['Greenmantle', '2.5', '650', '16.083'] ['Carnethy', '6', '2500', '48.35'] ['Craig Dunain', '6', '900', '33.65'] ['Ben Rha', '7.5', '800', '45.6'] ['Ben Lomond', '8', '3070', '62.267'] ['Goatfell', '8', '2866', '73.217'] ['Bens of Jura', '16', '7500', '204.617'] ['Cairnpapple', '6', '800', '36.367'] ['Scolty', '5', '800', '29.75'] ['Traprain', '6', '650', '39.75'] ['Lairig Ghru', '28', '2100', '192.667'] Josef > > > -- > Christopher Barker, Ph.D. > Oceanographer > > NOAA/OR&R/HAZMAT (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
a...@ajackson.org wrote: > it doesn't support the reasonable > (IMHO) behavior of treating quote delimited strings in the input file as a > single field. I'd use the csv module for that. Which makes me wonder if it might make sense to build some of the numpy table-reading stuff on top of it... -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
>On Mon, Jan 4, 2010 at 10:39 PM, wrote: >>>Hi folks, >>> >>>I'm taking a look once again at fromfile() for reading text files. I >>>often have the need to read a LOT of numbers form a text file, and it >>>can actually be pretty darn slow do i the normal python way: >>> .big snip >> >> I agree. I've tried using it, and usually find that it doesn't quite get >> there. >> >> I rather like the R command(s) for reading text files - except then I have to >> use R which is painful after using python and numpy. Although ggplot2 is >> awfully nice too ... but that is a later post. >> >> read.table(file, header = FALSE, sep = "", quote = "\"'", >> dec = ".", row.names, col.names, >> as.is = !stringsAsFactors, >> na.strings = "NA", colClasses = NA, nrows = -1, >> skip = 0, check.names = TRUE, fill = !blank.lines.skip, >> strip.white = FALSE, blank.lines.skip = TRUE, >> comment.char = "#", >> allowEscapes = FALSE, flush = FALSE, >> stringsAsFactors = default.stringsAsFactors(), >> fileEncoding = "", encoding = "unknown") ... big snip > > >Aren't the newly improved > >numpy.genfromtxt(fname, dtype=, comments='#', >delimiter=None, skiprows=0, converters=None, missing='', >missing_values=None, usecols=None, names=None, excludelist=None, >deletechars=None, case_sensitive=True, unpack=None, usemask=False, >loose=True) > >and friends indented to handle all this > >Josef > Reopening an old thread... genfromtxt is a big step forward. Something I'm fiddling with is trying to work through the book "Using R for Data Analysis and Graphics, Introduction, Code, and Commentary" by J H Maindonald (available online), in python. So I am trying to see what it takes in python/numpy to work his examples and problems, sort of a learning exercise for me. So anyway, with that introduction, here is a case that I believe genfromtxt fails on, because it doesn't support the reasonable (IMHO) behavior of treating quote delimited strings in the input file as a single field. Below is the example from the book... So we have 2 issues. The header for the first field is quote-blank-quote, and various values for field one have 1 to 3 blank delimited strings, but encapsulated in quotes. I'm putting something together to read it using shlex.split, since it honors strings protected by quote pairs. I'm not an excel person, but I think it might export data like this in a format similar to what is shown below. " " "distance" "climb" "time" "Greenmantle" 2.5 650 16.083 "Carnethy" 6 2500 48.35 "Craig Dunain" 6 900 33.65 "Ben Rha" 7.5 800 45.6 "Ben Lomond" 8 3070 62.267 "Goatfell" 8 2866 73.217 "Bens of Jura" 16 7500 204.617 "Cairnpapple" 6 800 36.367 "Scolty" 5 800 29.75 "Traprain" 6 650 39.75 "Lairig Ghru" 28 2100 192.667 -- --- | Alan K. Jackson| To see a World in a Grain of Sand | | a...@ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | --- ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Pauli Virtanen wrote: > I don't really like handling newlines specially. For instance, I could > have data like > > 1, 2, 3; > 4, 5, 6; > 7, 8, 9; > > Allowing an "alternative separator" would sound better to me. The above > data could then be read like > > fromfile('foo.txt', sep=' , ', sep2=' ; ') > > or perhaps > > fromfile('foo.txt', sep=[' , ', ' ; ']) I like this syntax better, but: 1) Yes you "could" have data like that, but do you? I've never seen it. Maybe other have. 2) if you did, it would probably indicate something the user would want reserved, like the shape of the array. And newlines really are a special case -- they have a special meaning, and they are very, very common (universal, even)! So, it's just more code than I'm probably going to write. If someone does want to write more code than I do, it would probably make sense to do what someone suggested in the ticket: write a optimized version of loadtxt in C. Anyway. I'll think about it when I poke at the code more. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
pe, 2010-01-08 kello 15:12 -0800, Christopher Barker kirjoitti: > 1) optionally allow newlines to serve as a delimiter, so large tables > can be read. I don't really like handling newlines specially. For instance, I could have data like 1, 2, 3; 4, 5, 6; 7, 8, 9; Allowing an "alternative separator" would sound better to me. The above data could then be read like fromfile('foo.txt', sep=' , ', sep2=' ; ') or perhaps fromfile('foo.txt', sep=[' , ', ' ; ']) Since whitespace matches also newlines, this would work. Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Fri, Jan 8, 2010 at 5:12 PM, Christopher Barker wrote: > Bruce Southey wrote: >> Also a user has to check for missing >> values or numpy has to warn a user > > I think warnings are next to useless for all but interactive work -- so > I don't want to rely on them > >> that missing values are present >> immediately after reading the data so the appropriate action can be >> taken (like using functions that handle missing values appropriately). >> That is my second problem with using codes (NaN, -9 etc) for >> missing values. > > But I think you're right -- if someone write code, tests with good > input, then later runs it with missing valued import, they are likely to > have not ever bothered to test for missing values. > > So I think missing values should only be replaced by something if the > user specifically asks for it. > >>> And the principle of fromfile() is that it is fast and simple, if you >>> want masked arrays, use slower, but more full-featured methods. >> >> So in that case it should fail with missing data. > > Well, I'm not so sure -- the point is performance, no reason not to have > high performing code that handles missing data. > >> What about '\r' and '\n\r'? > > I have thought about that -- I'm hoping that python's text file reading > will just take care of it, but as we're working with C file handles here > (I think), I guess not. '/n/r' is easy -- the '/r' is just extra > whitespace. 'r' is another case to handle. > > >> My problem with this is that you are reading one huge 1-D array (that >> you can resize later) rather than a 2-D array with rows and columns >> (which is what I deal with). > > That's because fromfile()) is not designed to be row-oriented at all, > and the binary read certainly isn't. I'm just trying to make this easy > -- though it's not turning out that way! > > > But I agree that you can have an option >> to say treat '\n' or '\r' as a delimiter but I think it should be >> turned off by default. > > that's what I've done. > >> You should have a corresponding value for ints because raising an >> exceptionwould be inconsistent with allowing floats to have a value. > > I'm not sure I care, really -- but I think having the user specify the > fill value is the best option, anyway. > > josef.p...@gmail.com wrote: none -- exactly why I think \n is a special case. >>> What about '\r' and '\n\r'? >> >> Yes, I forgot about this, and it will be the most common case for >> Windows users like myself. >> >> I think \r should be stripped automatically, like in non-binary >> reading of files in python. > > except for folks like me that have old mac files laying around...so I > want this like "Universal newlines" support. > >> A warning would be good, but doing np.any(np.isnan(x)) or >> np.isnan(x).sum() on the result is always a good idea for a user when >> missing values are possibility. > > right, but the issue is the user has to know that they are possible, and > we all know how carefully we all read docs! > > Thanks for your input -- I think I know what I'd like to do, but it's > proving less than trivial to do it, so we'll see. > > In short: > > 1) optionally allow newlines to serve as a delimiter, so large tables > can be read. > > 2) raise an exception for missing values, unless: > 3) the user specifies a fill value of their choice (compatible with > the chosen data type. > > > -Chris > > I fully agree with your approach! Thanks for considering my thoughts! Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Bruce Southey wrote: > Also a user has to check for missing > values or numpy has to warn a user I think warnings are next to useless for all but interactive work -- so I don't want to rely on them > that missing values are present > immediately after reading the data so the appropriate action can be > taken (like using functions that handle missing values appropriately). > That is my second problem with using codes (NaN, -9 etc) for > missing values. But I think you're right -- if someone write code, tests with good input, then later runs it with missing valued import, they are likely to have not ever bothered to test for missing values. So I think missing values should only be replaced by something if the user specifically asks for it. >> And the principle of fromfile() is that it is fast and simple, if you >> want masked arrays, use slower, but more full-featured methods. > > So in that case it should fail with missing data. Well, I'm not so sure -- the point is performance, no reason not to have high performing code that handles missing data. > What about '\r' and '\n\r'? I have thought about that -- I'm hoping that python's text file reading will just take care of it, but as we're working with C file handles here (I think), I guess not. '/n/r' is easy -- the '/r' is just extra whitespace. 'r' is another case to handle. > My problem with this is that you are reading one huge 1-D array (that > you can resize later) rather than a 2-D array with rows and columns > (which is what I deal with). That's because fromfile()) is not designed to be row-oriented at all, and the binary read certainly isn't. I'm just trying to make this easy -- though it's not turning out that way! > But I agree that you can have an option > to say treat '\n' or '\r' as a delimiter but I think it should be > turned off by default. that's what I've done. > You should have a corresponding value for ints because raising an > exceptionwould be inconsistent with allowing floats to have a value. I'm not sure I care, really -- but I think having the user specify the fill value is the best option, anyway. josef.p...@gmail.com wrote: >>> none -- exactly why I think \n is a special case. >> What about '\r' and '\n\r'? > > Yes, I forgot about this, and it will be the most common case for > Windows users like myself. > > I think \r should be stripped automatically, like in non-binary > reading of files in python. except for folks like me that have old mac files laying around...so I want this like "Universal newlines" support. > A warning would be good, but doing np.any(np.isnan(x)) or > np.isnan(x).sum() on the result is always a good idea for a user when > missing values are possibility. right, but the issue is the user has to know that they are possible, and we all know how carefully we all read docs! Thanks for your input -- I think I know what I'd like to do, but it's proving less than trivial to do it, so we'll see. In short: 1) optionally allow newlines to serve as a delimiter, so large tables can be read. 2) raise an exception for missing values, unless: 3) the user specifies a fill value of their choice (compatible with the chosen data type. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 11:10 PM, Bruce Southey wrote: > On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker > wrote: >> Bruce Southey wrote: wrote: >> >>> Using the numpy NaN or similar (noting R's approach to missing values >>> which in turn allows it to have the above functionality) is just a >>> very bad idea for missing values because you always have to check that >>> which NaN is a missing value and which was due to some numerical >>> calculation. >> >> well, this is specific to reading files, so you know where it came from. > > You can only know where it came from when you compare the original > array to the transformed one. Also a user has to check for missing > values or numpy has to warn a user that missing values are present > immediately after reading the data so the appropriate action can be > taken (like using functions that handle missing values appropriately). > That is my second problem with using codes (NaN, -9 etc) for > missing values. > > > >> And the principle of fromfile() is that it is fast and simple, if you >> want masked arrays, use slower, but more full-featured methods. > > So in that case it should fail with missing data. > >> >> However, in this case: >> >> In [9]: np.fromstring("3, 4, NaN, 5", sep=",") >> Out[9]: array([ 3., 4., NaN, 5.]) >> >> >> An actual NaN is read from the file, rather than a missing value. >> Perhaps the user does want the distinction, so maybe it should really >> only fil it in if the users asks for it, but specifying >> "missing_value=np.nan" or something. > > Yes, that is my first problem of using predefined codes for missing > values as you do not always know what is going to occur in the data. > > >> From what I can see is that you expect that fromfile() should only >>> split at the supplied delimiters, optionally(?) strip any whitespace >> >> whitespace stripping is not optional. >> >>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >>> actually assumes multiple delimiters because there is no comma between >>> 4 and 5 and 8 and 9. >> >> Yes, that's the point. I thought about allowing arbitrary multiple >> delimiters, but I think '/n' is a special case - for instance, a comma >> at the end of some numbers might mean missing data, but a '\n' would not. >> >> And I couldn't really think of a useful use-case for arbitrary multiple >> delimiters. >> >>> In Josef's last case how many 'missing values should there be? >> >> >> extra newlines at end of file >> >> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' >> >> none -- exactly why I think \n is a special case. > > What about '\r' and '\n\r'? Yes, I forgot about this, and it will be the most common case for Windows users like myself. I think \r should be stripped automatically, like in non-binary reading of files in python. > >> >> What about: >> >> extra newlines in the middle of the file >> >> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' >> >> I think they should be ignored, but I hope I'm not making something that >> is too specific to my personal needs. > > Not really, it is more that I am being somewhat difficult to ensure I > understand what you actually need. > > My problem with this is that you are reading one huge 1-D array (that > you can resize later) rather than a 2-D array with rows and columns > (which is what I deal with). But I agree that you can have an option > to say treat '\n' or '\r' as a delimiter but I think it should be > turned off by default. > > >> >> Travis Oliphant wrote: >>> +1 (ignoring new-lines transparently is a nice feature). You can also >>> use sscanf with weave to read most files. >> >> right -- but that requires weave. In fact, MATLAB has a fscanf function >> that allows you to pass in a C format string and it vectorizes it to use >> the same one over an over again until it's done. It's actually quite >> powerful and flexible. I once started with that in mind, but didn't have >> the C chops to do it. I ended up with a tool that only did doubles (come >> to think of it, MATLAB only does doubles, anyway...) >> >> I may some day write a whole new C (or, more likely, Cython) function >> that does something like that, but for now, I'm jsut trying to get >> fromfile to be useful for me. >> >> >>> +1 (much preferrable to insert NaN or other user value than raise >>> ValueError in my opinion) >> >> But raise an error for integer types? >> >> I guess this is still up the air -- no consensus yet. >> >> Thanks, >> >> -Chris >> > > You should have a corresponding value for ints because raising an > exceptionwould be inconsistent with allowing floats to have a value. No, I think different nan/missing value handling between integers and float is a natural distinction. There is no default nan code for integers, but nan (and inf) are valid floating point numbers (even if nan is not a number). And the default treatment of nans in numpy is getting pretty good (e.g. I like the new (nan)sort). > If you must keep the
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker wrote: > Bruce Southey wrote: >>> wrote: > >> Using the numpy NaN or similar (noting R's approach to missing values >> which in turn allows it to have the above functionality) is just a >> very bad idea for missing values because you always have to check that >> which NaN is a missing value and which was due to some numerical >> calculation. > > well, this is specific to reading files, so you know where it came from. You can only know where it came from when you compare the original array to the transformed one. Also a user has to check for missing values or numpy has to warn a user that missing values are present immediately after reading the data so the appropriate action can be taken (like using functions that handle missing values appropriately). That is my second problem with using codes (NaN, -9 etc) for missing values. > And the principle of fromfile() is that it is fast and simple, if you > want masked arrays, use slower, but more full-featured methods. So in that case it should fail with missing data. > > However, in this case: > > In [9]: np.fromstring("3, 4, NaN, 5", sep=",") > Out[9]: array([ 3., 4., NaN, 5.]) > > > An actual NaN is read from the file, rather than a missing value. > Perhaps the user does want the distinction, so maybe it should really > only fil it in if the users asks for it, but specifying > "missing_value=np.nan" or something. Yes, that is my first problem of using predefined codes for missing values as you do not always know what is going to occur in the data. > >>>From what I can see is that you expect that fromfile() should only >> split at the supplied delimiters, optionally(?) strip any whitespace > > whitespace stripping is not optional. > >> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >> actually assumes multiple delimiters because there is no comma between >> 4 and 5 and 8 and 9. > > Yes, that's the point. I thought about allowing arbitrary multiple > delimiters, but I think '/n' is a special case - for instance, a comma > at the end of some numbers might mean missing data, but a '\n' would not. > > And I couldn't really think of a useful use-case for arbitrary multiple > delimiters. > >> In Josef's last case how many 'missing values should there be? > > >> extra newlines at end of file > >> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' > > none -- exactly why I think \n is a special case. What about '\r' and '\n\r'? > > What about: > >> extra newlines in the middle of the file > >> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' > > I think they should be ignored, but I hope I'm not making something that > is too specific to my personal needs. Not really, it is more that I am being somewhat difficult to ensure I understand what you actually need. My problem with this is that you are reading one huge 1-D array (that you can resize later) rather than a 2-D array with rows and columns (which is what I deal with). But I agree that you can have an option to say treat '\n' or '\r' as a delimiter but I think it should be turned off by default. > > Travis Oliphant wrote: >> +1 (ignoring new-lines transparently is a nice feature). You can also >> use sscanf with weave to read most files. > > right -- but that requires weave. In fact, MATLAB has a fscanf function > that allows you to pass in a C format string and it vectorizes it to use > the same one over an over again until it's done. It's actually quite > powerful and flexible. I once started with that in mind, but didn't have > the C chops to do it. I ended up with a tool that only did doubles (come > to think of it, MATLAB only does doubles, anyway...) > > I may some day write a whole new C (or, more likely, Cython) function > that does something like that, but for now, I'm jsut trying to get > fromfile to be useful for me. > > >> +1 (much preferrable to insert NaN or other user value than raise >> ValueError in my opinion) > > But raise an error for integer types? > > I guess this is still up the air -- no consensus yet. > > Thanks, > > -Chris > You should have a corresponding value for ints because raising an exceptionwould be inconsistent with allowing floats to have a value. If you must keep the user defined dtype then, as Josef suggests, just use some code be it -999 or most negative number supported by the OS for the defined dtype or, just convert the ints into floats if the user does not define a missing value code. It would be nice to either return the number of missing values or display a warning indicating how many occurred. Bruce ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
josef.p...@gmail.com wrote: >>> +1 (much preferrable to insert NaN or other user value than raise >>> ValueError in my opinion) >> But raise an error for integer types? >> >> I guess this is still up the air -- no consensus yet. > > raise an exception, I hate the silent cast of nan to integer zero, me too -- I'm sorry, I wasn't clear -- I'm not going to write any code that returns a zero for a missing value. These are the options I'd consider: 1) Have the user specify what to use for missing values, otherwise, raise an exception 2) Insert a NaN for floating points types, and raise an exception for integer types. what's not clear is whether (2) is a good idea. As for (1), I just don't know if I'm going to get around to writing the code, and I maybe more kwargs is a bad idea -- though maybe not. Enough talk: I've got ugly C code to wade through... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 4:45 PM, Christopher Barker wrote: > Bruce Southey wrote: >>> wrote: > >> Using the numpy NaN or similar (noting R's approach to missing values >> which in turn allows it to have the above functionality) is just a >> very bad idea for missing values because you always have to check that >> which NaN is a missing value and which was due to some numerical >> calculation. > > well, this is specific to reading files, so you know where it came from. > And the principle of fromfile() is that it is fast and simple, if you > want masked arrays, use slower, but more full-featured methods. > > However, in this case: > > In [9]: np.fromstring("3, 4, NaN, 5", sep=",") > Out[9]: array([ 3., 4., NaN, 5.]) > > > An actual NaN is read from the file, rather than a missing value. > Perhaps the user does want the distinction, so maybe it should really > only fil it in if the users asks for it, but specifying > "missing_value=np.nan" or something. > >>>From what I can see is that you expect that fromfile() should only >> split at the supplied delimiters, optionally(?) strip any whitespace > > whitespace stripping is not optional. > >> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >> actually assumes multiple delimiters because there is no comma between >> 4 and 5 and 8 and 9. > > Yes, that's the point. I thought about allowing arbitrary multiple > delimiters, but I think '/n' is a special case - for instance, a comma > at the end of some numbers might mean missing data, but a '\n' would not. > > And I couldn't really think of a useful use-case for arbitrary multiple > delimiters. > >> In Josef's last case how many 'missing values should there be? > > >> extra newlines at end of file > >> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' > > none -- exactly why I think \n is a special case. > > What about: > >> extra newlines in the middle of the file > >> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' > > I think they should be ignored, but I hope I'm not making something that > is too specific to my personal needs. > > Travis Oliphant wrote: >> +1 (ignoring new-lines transparently is a nice feature). You can also >> use sscanf with weave to read most files. > > right -- but that requires weave. In fact, MATLAB has a fscanf function > that allows you to pass in a C format string and it vectorizes it to use > the same one over an over again until it's done. It's actually quite > powerful and flexible. I once started with that in mind, but didn't have > the C chops to do it. I ended up with a tool that only did doubles (come > to think of it, MATLAB only does doubles, anyway...) > > I may some day write a whole new C (or, more likely, Cython) function > that does something like that, but for now, I'm jsut trying to get > fromfile to be useful for me. > > >> +1 (much preferrable to insert NaN or other user value than raise >> ValueError in my opinion) > > But raise an error for integer types? > > I guess this is still up the air -- no consensus yet. raise an exception, I hate the silent cast of nan to integer zero, too much debugging and useless if there are real zeros. (or use some -999 kind of thing if user defined nan codes are allowed, but I just work with float if I expect nans/missing values.) Josef > > Thanks, > > -Chris > > > > > > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Bruce Southey wrote: >> wrote: > Using the numpy NaN or similar (noting R's approach to missing values > which in turn allows it to have the above functionality) is just a > very bad idea for missing values because you always have to check that > which NaN is a missing value and which was due to some numerical > calculation. well, this is specific to reading files, so you know where it came from. And the principle of fromfile() is that it is fast and simple, if you want masked arrays, use slower, but more full-featured methods. However, in this case: In [9]: np.fromstring("3, 4, NaN, 5", sep=",") Out[9]: array([ 3., 4., NaN, 5.]) An actual NaN is read from the file, rather than a missing value. Perhaps the user does want the distinction, so maybe it should really only fil it in if the users asks for it, but specifying "missing_value=np.nan" or something. >>From what I can see is that you expect that fromfile() should only > split at the supplied delimiters, optionally(?) strip any whitespace whitespace stripping is not optional. > Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' > actually assumes multiple delimiters because there is no comma between > 4 and 5 and 8 and 9. Yes, that's the point. I thought about allowing arbitrary multiple delimiters, but I think '/n' is a special case - for instance, a comma at the end of some numbers might mean missing data, but a '\n' would not. And I couldn't really think of a useful use-case for arbitrary multiple delimiters. > In Josef's last case how many 'missing values should there be? >> extra newlines at end of file >> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' none -- exactly why I think \n is a special case. What about: >> extra newlines in the middle of the file >> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' I think they should be ignored, but I hope I'm not making something that is too specific to my personal needs. Travis Oliphant wrote: > +1 (ignoring new-lines transparently is a nice feature). You can also > use sscanf with weave to read most files. right -- but that requires weave. In fact, MATLAB has a fscanf function that allows you to pass in a C format string and it vectorizes it to use the same one over an over again until it's done. It's actually quite powerful and flexible. I once started with that in mind, but didn't have the C chops to do it. I ended up with a tool that only did doubles (come to think of it, MATLAB only does doubles, anyway...) I may some day write a whole new C (or, more likely, Cython) function that does something like that, but for now, I'm jsut trying to get fromfile to be useful for me. > +1 (much preferrable to insert NaN or other user value than raise > ValueError in my opinion) But raise an error for integer types? I guess this is still up the air -- no consensus yet. Thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Jan 7, 2010, at 2:32 PM, josef.p...@gmail.com wrote: > On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker > wrote: >> Pauli Virtanen wrote: >>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: >>> it also does odd things with spaces embedded in the separator: ", $ #" matches all of: ",$#" ", $#" ",$ #" >> >>> That's a documented feature: >> >> Fair enough. >> >> OK, I've written a patch that allows newlines to be interpreted as >> separators in addition to whatever is specified in sep. >> >> In the process of testing, I found again these issues, which are >> still >> marked as "needs decision". >> >> http://projects.scipy.org/numpy/ticket/883 >> >> In short: what to do with missing values? >> >> I'd like to address this bug, but I need a decision to do so. >> >> >> My proposal: >> >> Raise an ValueError with missing values. >> >> >> Justification: >> >> No function should EVER return data that is not there. Period. It is >> simply asking for hard to find bugs. Therefore: >> >> fromstring("3, 4,,5", sep=",") >> >> Should never, ever, return: >> >> array([ 3., 4., 0., 5.]) >> >> Which is what it does now. bad. bad. bad. >> >> >> >> >> Alternatives: >> >> A) Raising a ValueError is the easiest way to get "proper" >> behavior. >> Folks can use a more sophisticated file reader if they want missing >> values handled. I'm willing to contribute this patch. >> >> B) If the dtype is a floating point type, NaN could fill in the >> missing values -- a fine idea, but you can't use it for integers, and >> zero is a really bad replacement! >> >> C) The user could specify what they want filled in for missing >> values. This is a fine idea, though I'm not sure I want to take the >> time >> to impliment it. >> >> Oh, and this is a bug too, with probably the same solution: >> >> In [20]: np.fromstring("hjba", sep=',') >> Out[20]: array([ 0.]) >> >> In [26]: np.fromstring("34gytf39", sep=',') >> Out[26]: array([ 34.]) >> >> >> One more unresolved question: >> >> what should: >> >> np.fromstring("3, 4, 5,", sep=",") >> >> return? >> >> it currently returns: >> >> array([ 3., 4., 5.]) >> >> which seems a bit inconsitent with missing value handling. I also >> found >> a bug: >> >> In [6]: np.fromstring("3, 4, 5 , ", sep=",") >> Out[6]: array([ 3., 4., 5., 0.]) >> >> so if there is some extra whitespace in there, it does return a >> missing >> value. With my proposal, that wouldn't happen, but you might get an >> exception. I think you should, but it'll be easier to implement my >> "allow newlines" code if not. >> >> >> so, should I do (A) ? >> >> >> Another question: >> >> I've got a patch mostly working (except for the above issues) that >> will >> allow fromfile/string to read multiline non-whitespace separated >> data in >> one shot: >> >> >> In [15]: str >> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >> >> In [16]: np.fromstring(str, sep=',', allow_newlines=True) >> Out[16]: >> array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., >> 11., >> 12.]) >> >> >> I think this is a very helpful enhancement, and, as it is a new >> kwarg, >> backward compatible: >> >> 1) Might it be accepted for inclusion? >> >> 2) Is the name for the flag OK: "allow_newlines"? It's pretty >> explicit, >> but also long -- I used it for the flag name in the C code, too. >> >> 3) What C datatype should I use for a boolean flag? I used a char, >> but I >> don't know what the numpy standard is. >> >> >> -Chris >> >> > > I don't know much about this, just a few more test cases > > comma and newline > str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' > > extra comma at end of file > str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' > > extra newlines at end of file > str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' > > It would be nice if these cases would go through without missing > values or exception, but I don't often have files that are clean > enough for fromfile(). +1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. > > I'm in favor of nan for missing values with floating point numbers. It > would make it easy to read correctly formatted csv files, even if the > data is not complete. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 2:32 PM, wrote: > On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker > wrote: >> Pauli Virtanen wrote: >>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: >>> it also does odd things with spaces embedded in the separator: ", $ #" matches all of: ",$#" ", $#" ",$ #" >> >>> That's a documented feature: >> >> Fair enough. >> >> OK, I've written a patch that allows newlines to be interpreted as >> separators in addition to whatever is specified in sep. >> >> In the process of testing, I found again these issues, which are still >> marked as "needs decision". >> >> http://projects.scipy.org/numpy/ticket/883 >> >> In short: what to do with missing values? >> >> I'd like to address this bug, but I need a decision to do so. >> >> >> My proposal: >> >> Raise an ValueError with missing values. >> >> >> Justification: >> >> No function should EVER return data that is not there. Period. It is >> simply asking for hard to find bugs. Therefore: >> >> fromstring("3, 4,,5", sep=",") >> >> Should never, ever, return: >> >> array([ 3., 4., 0., 5.]) >> >> Which is what it does now. bad. bad. bad. >> >> >> >> >> Alternatives: >> >> A) Raising a ValueError is the easiest way to get "proper" behavior. >> Folks can use a more sophisticated file reader if they want missing >> values handled. I'm willing to contribute this patch. >> >> B) If the dtype is a floating point type, NaN could fill in the >> missing values -- a fine idea, but you can't use it for integers, and >> zero is a really bad replacement! >> >> C) The user could specify what they want filled in for missing >> values. This is a fine idea, though I'm not sure I want to take the time >> to impliment it. >> >> Oh, and this is a bug too, with probably the same solution: >> >> In [20]: np.fromstring("hjba", sep=',') >> Out[20]: array([ 0.]) >> >> In [26]: np.fromstring("34gytf39", sep=',') >> Out[26]: array([ 34.]) >> >> >> One more unresolved question: >> >> what should: >> >> np.fromstring("3, 4, 5,", sep=",") >> >> return? >> >> it currently returns: >> >> array([ 3., 4., 5.]) >> >> which seems a bit inconsitent with missing value handling. I also found >> a bug: >> >> In [6]: np.fromstring("3, 4, 5 , ", sep=",") >> Out[6]: array([ 3., 4., 5., 0.]) >> >> so if there is some extra whitespace in there, it does return a missing >> value. With my proposal, that wouldn't happen, but you might get an >> exception. I think you should, but it'll be easier to implement my >> "allow newlines" code if not. >> >> >> so, should I do (A) ? >> >> >> Another question: >> >> I've got a patch mostly working (except for the above issues) that will >> allow fromfile/string to read multiline non-whitespace separated data in >> one shot: >> >> >> In [15]: str >> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >> >> In [16]: np.fromstring(str, sep=',', allow_newlines=True) >> Out[16]: >> array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., >> 12.]) >> >> >> I think this is a very helpful enhancement, and, as it is a new kwarg, >> backward compatible: >> >> 1) Might it be accepted for inclusion? >> >> 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit, >> but also long -- I used it for the flag name in the C code, too. >> >> 3) What C datatype should I use for a boolean flag? I used a char, but I >> don't know what the numpy standard is. >> >> >> -Chris >> >> > > I don't know much about this, just a few more test cases > > comma and newline > str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' > > extra comma at end of file > str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' > > extra newlines at end of file > str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' > > It would be nice if these cases would go through without missing > values or exception, but I don't often have files that are clean > enough for fromfile(). > > I'm in favor of nan for missing values with floating point numbers. It > would make it easy to read correctly formatted csv files, even if the > data is not complete. > Using the numpy NaN or similar (noting R's approach to missing values which in turn allows it to have the above functionality) is just a very bad idea for missing values because you always have to check that which NaN is a missing value and which was due to some numerical calculation. It is a very bad idea because we have masked arrays that nicely but slowly handle this situation. >From what I can see is that you expect that fromfile() should only split at the supplied delimiters, optionally(?) strip any whitespace and force a specific dtype. I would agree that the failure of any of one these should create an exception by default rather than making the best guess. So 'missing data' would potentially fail with forcing the specified dtype. Thus, you should either create an exception for invalid data (with appropriate location) or use masked arrays. Your output from this string '1, 2
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker wrote: > Pauli Virtanen wrote: >> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: >> it also does odd things with spaces >>> embedded in the separator: >>> >>> ", $ #" matches all of: ",$#" ", $#" ",$ #" > >> That's a documented feature: > > Fair enough. > > OK, I've written a patch that allows newlines to be interpreted as > separators in addition to whatever is specified in sep. > > In the process of testing, I found again these issues, which are still > marked as "needs decision". > > http://projects.scipy.org/numpy/ticket/883 > > In short: what to do with missing values? > > I'd like to address this bug, but I need a decision to do so. > > > My proposal: > > Raise an ValueError with missing values. > > > Justification: > > No function should EVER return data that is not there. Period. It is > simply asking for hard to find bugs. Therefore: > > fromstring("3, 4,,5", sep=",") > > Should never, ever, return: > > array([ 3., 4., 0., 5.]) > > Which is what it does now. bad. bad. bad. > > > > > Alternatives: > > A) Raising a ValueError is the easiest way to get "proper" behavior. > Folks can use a more sophisticated file reader if they want missing > values handled. I'm willing to contribute this patch. > > B) If the dtype is a floating point type, NaN could fill in the > missing values -- a fine idea, but you can't use it for integers, and > zero is a really bad replacement! > > C) The user could specify what they want filled in for missing > values. This is a fine idea, though I'm not sure I want to take the time > to impliment it. > > Oh, and this is a bug too, with probably the same solution: > > In [20]: np.fromstring("hjba", sep=',') > Out[20]: array([ 0.]) > > In [26]: np.fromstring("34gytf39", sep=',') > Out[26]: array([ 34.]) > > > One more unresolved question: > > what should: > > np.fromstring("3, 4, 5,", sep=",") > > return? > > it currently returns: > > array([ 3., 4., 5.]) > > which seems a bit inconsitent with missing value handling. I also found > a bug: > > In [6]: np.fromstring("3, 4, 5 , ", sep=",") > Out[6]: array([ 3., 4., 5., 0.]) > > so if there is some extra whitespace in there, it does return a missing > value. With my proposal, that wouldn't happen, but you might get an > exception. I think you should, but it'll be easier to implement my > "allow newlines" code if not. > > > so, should I do (A) ? > > > Another question: > > I've got a patch mostly working (except for the above issues) that will > allow fromfile/string to read multiline non-whitespace separated data in > one shot: > > > In [15]: str > Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' > > In [16]: np.fromstring(str, sep=',', allow_newlines=True) > Out[16]: > array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., > 12.]) > > > I think this is a very helpful enhancement, and, as it is a new kwarg, > backward compatible: > > 1) Might it be accepted for inclusion? > > 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit, > but also long -- I used it for the flag name in the C code, too. > > 3) What C datatype should I use for a boolean flag? I used a char, but I > don't know what the numpy standard is. > > > -Chris > > I don't know much about this, just a few more test cases comma and newline str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' extra comma at end of file str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' extra newlines at end of file str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' It would be nice if these cases would go through without missing values or exception, but I don't often have files that are clean enough for fromfile(). I'm in favor of nan for missing values with floating point numbers. It would make it easy to read correctly formatted csv files, even if the data is not complete. Josef > > > > > > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Pauli Virtanen wrote: > ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: > it also does odd things with spaces >> embedded in the separator: >> >> ", $ #" matches all of: ",$#" ", $#" ",$ #" > That's a documented feature: Fair enough. OK, I've written a patch that allows newlines to be interpreted as separators in addition to whatever is specified in sep. In the process of testing, I found again these issues, which are still marked as "needs decision". http://projects.scipy.org/numpy/ticket/883 In short: what to do with missing values? I'd like to address this bug, but I need a decision to do so. My proposal: Raise an ValueError with missing values. Justification: No function should EVER return data that is not there. Period. It is simply asking for hard to find bugs. Therefore: fromstring("3, 4,,5", sep=",") Should never, ever, return: array([ 3., 4., 0., 5.]) Which is what it does now. bad. bad. bad. Alternatives: A) Raising a ValueError is the easiest way to get "proper" behavior. Folks can use a more sophisticated file reader if they want missing values handled. I'm willing to contribute this patch. B) If the dtype is a floating point type, NaN could fill in the missing values -- a fine idea, but you can't use it for integers, and zero is a really bad replacement! C) The user could specify what they want filled in for missing values. This is a fine idea, though I'm not sure I want to take the time to impliment it. Oh, and this is a bug too, with probably the same solution: In [20]: np.fromstring("hjba", sep=',') Out[20]: array([ 0.]) In [26]: np.fromstring("34gytf39", sep=',') Out[26]: array([ 34.]) One more unresolved question: what should: np.fromstring("3, 4, 5,", sep=",") return? it currently returns: array([ 3., 4., 5.]) which seems a bit inconsitent with missing value handling. I also found a bug: In [6]: np.fromstring("3, 4, 5 , ", sep=",") Out[6]: array([ 3., 4., 5., 0.]) so if there is some extra whitespace in there, it does return a missing value. With my proposal, that wouldn't happen, but you might get an exception. I think you should, but it'll be easier to implement my "allow newlines" code if not. so, should I do (A) ? Another question: I've got a patch mostly working (except for the above issues) that will allow fromfile/string to read multiline non-whitespace separated data in one shot: In [15]: str Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' In [16]: np.fromstring(str, sep=',', allow_newlines=True) Out[16]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.]) I think this is a very helpful enhancement, and, as it is a new kwarg, backward compatible: 1) Might it be accepted for inclusion? 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit, but also long -- I used it for the flag name in the C code, too. 3) What C datatype should I use for a boolean flag? I used a char, but I don't know what the numpy standard is. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: [clip] > I also notice that it supports separators of arbitrary length, which I > wonder how useful that is. But it also does odd things with spaces > embedded in the separator: > > ", $ #" matches all of: ",$#" ", $#" ",$ #" > > Is it worth trying to fix that? That's a documented feature: sep : str Separator between items if file is a text file. Empty ("") separator means the file should be treated as binary. Spaces (" ") in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Jan 5, 2010, at 12:32 PM, Christopher Barker wrote: > josef.p...@gmail.com wrote: >> On Mon, Jan 4, 2010 at 10:39 PM, wrote: >>> I rather like the R command(s) for reading text files > >> Aren't the newly improved >> >> numpy.genfromtxt() > > ... > >> and friends indented to handle all this > > Yes, they are, and they are great, but not really all that fast. If > you've got big complicated tables of data to read, then genfromtxt is > the way to go -- it's a great tool. However, for the simple stuff, it's > not really optimized. genfromtxt is nothing but loadtxt overloaded to deal with undefined dtype and missing entries. It's doomed to be slower, and it shouldn't be used if you know your data is well-defined and well-behaved. Stick to loadtxt > I also find I have to read a lot of text files > that aren't tables of data, but rather an odd mix of stuff, but still a > lot of reading lots of numbers from a file. Well, everything depends on what kind of stuff you have in your mix, I guess... > so fromfile() is 3.5 times as fast as loadtxt and 4.5 times as fast as > genfromtxt. That does make a difference for me -- the user waiting 4 > seconds, rather than one second to load a file matters. Rmmbr that fromfile is C when loadtxt and genfromtxt are Python... > I suppose another option might be to see if I can optimize the inner > scanning function of genfromtxt with Cython or C, but I'm not sure > that's possible, as it's really very flexible, and re-writing all of > that without Python would be really painful! Well, there's room for some optimization for particular cases (dtype!=None), but the generic case will be tricky... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
josef.p...@gmail.com wrote: > On Mon, Jan 4, 2010 at 10:39 PM, wrote: >> I rather like the R command(s) for reading text files > Aren't the newly improved > > numpy.genfromtxt() ... > and friends indented to handle all this Yes, they are, and they are great, but not really all that fast. If you've got big complicated tables of data to read, then genfromtxt is the way to go -- it's a great tool. However, for the simple stuff, it's not really optimized. I also find I have to read a lot of text files that aren't tables of data, but rather an odd mix of stuff, but still a lot of reading lots of numbers from a file. As far as I can tell, genfromtxt and loadtxt can only load the entire file as a table (a very common situation, of course). Paul Ivanov wrote: > Just a potshot, but have you tried np.loadtxt? > > I find it pretty fast. I guess I should have posted timings in the first place: In [19]: timeit timing.time_genfromtxt() 10 loops, best of 3: 216 ms per loop In [20]: timeit timing.time_loadtxt() 10 loops, best of 3: 166 ms per loop In [21]: timeit timing.time_fromfile() 10 loops, best of 3: 47.1 ms per loop (40,000 doubles from a space-delimted text file) so fromfile() is 3.5 times as fast as loadtxt and 4.5 times as fast as genfromtxt. That does make a difference for me -- the user waiting 4 seconds, rather than one second to load a file matters. I suppose another option might be to see if I can optimize the inner scanning function of genfromtxt with Cython or C, but I'm not sure that's possible, as it's really very flexible, and re-writing all of that without Python would be really painful! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
Christopher Barker, on 2010-01-04 17:05, wrote: > Hi folks, > > I'm taking a look once again at fromfile() for reading text files. I > often have the need to read a LOT of numbers form a text file, and it > can actually be pretty darn slow do i the normal python way: > > for line in file: > data = map(float, line.strip().split()) > > > or various other versions that are similar. It really does take longer > to read the text, split it up, convert to a number, then put that number > into a numpy array, than it does to simply read it straight into the array. > > However, as it stands, fromfile() turn out to be next to useless for > anything but whitespace separated text. Full set of ideas here: > > http://projects.scipy.org/numpy/ticket/909 > > However, for the moment, I'm digging into the code to address a > particular problem -- reading files like this: > > 123, 65.6, 789 > 23, 3.2, 34 > ... > > That is comma (or whatever) separated text -- pretty common stuff. > > The problem with the current code is that you can't read more than one > line at time with fromfile: > > a = np.fromfile(infile, sep=",") > > will read until it doesn't find a comma, and thus only one line, as > there is no comma after each line. As this is a really typical case, I > think it should be supported. Just a potshot, but have you tried np.loadtxt? I find it pretty fast. > > Here is the question: > > The work of finding the separator is done in: > > multiarray/ctors.c: fromfile_skip_separator() > > It looks like it wouldn't be too hard to add some code in there to look > for a newline, and consider that a valid separator. However, that would > break backward compatibility. So maybe a flag could be passed in, saying > you wanted to support newlines. The problem is that flag would have to > get passed all the way through to this function (and also for fromstring). > > I also notice that it supports separators of arbitrary length, which I > wonder how useful that is. But it also does odd things with spaces > embedded in the separator: > > ", $ #" matches all of: ",$#" ", $#" ",$ #" > > Is it worth trying to fix that? > > > In the longer term, it would be really nice to support comments as well, > tough that would require more of a re-factoring of the code, I think > (though maybe not -- I suppose a call to fromfile_skip_separator() could > look for a comment character, then if it found one, skip to where the > comment ends -- hmmm. > > thanks for any feedback, > > -Chris > > > > > > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
On Mon, Jan 4, 2010 at 10:39 PM, wrote: >>Hi folks, >> >>I'm taking a look once again at fromfile() for reading text files. I >>often have the need to read a LOT of numbers form a text file, and it >>can actually be pretty darn slow do i the normal python way: >> >>for line in file: >> data = map(float, line.strip().split()) >> >> >>or various other versions that are similar. It really does take longer >>to read the text, split it up, convert to a number, then put that number >>into a numpy array, than it does to simply read it straight into the array. >> >>However, as it stands, fromfile() turn out to be next to useless for >>anything but whitespace separated text. Full set of ideas here: >> >>http://projects.scipy.org/numpy/ticket/909 >> >>However, for the moment, I'm digging into the code to address a >>particular problem -- reading files like this: >> >>123, 65.6, 789 >>23, 3.2, 34 >>... >> >>That is comma (or whatever) separated text -- pretty common stuff. >> >>The problem with the current code is that you can't read more than one >>line at time with fromfile: >> >>a = np.fromfile(infile, sep=",") >> >>will read until it doesn't find a comma, and thus only one line, as >>there is no comma after each line. As this is a really typical case, I >>think it should be supported. >> >>Here is the question: >> >>The work of finding the separator is done in: >> >>multiarray/ctors.c: fromfile_skip_separator() >> >>It looks like it wouldn't be too hard to add some code in there to look >>for a newline, and consider that a valid separator. However, that would >>break backward compatibility. So maybe a flag could be passed in, saying >>you wanted to support newlines. The problem is that flag would have to >>get passed all the way through to this function (and also for fromstring). >> >>I also notice that it supports separators of arbitrary length, which I >>wonder how useful that is. But it also does odd things with spaces >>embedded in the separator: >> >>", $ #" matches all of: ",$#" ", $#" ",$ #" >> >>Is it worth trying to fix that? >> >> >>In the longer term, it would be really nice to support comments as well, >>tough that would require more of a re-factoring of the code, I think >>(though maybe not -- I suppose a call to fromfile_skip_separator() could >>look for a comment character, then if it found one, skip to where the >>comment ends -- hmmm. >> >>thanks for any feedback, >> >>-Chris >> > > I agree. I've tried using it, and usually find that it doesn't quite get > there. > > I rather like the R command(s) for reading text files - except then I have to > use R which is painful after using python and numpy. Although ggplot2 is > awfully nice too ... but that is a later post. > > read.table(file, header = FALSE, sep = "", quote = "\"'", > dec = ".", row.names, col.names, > as.is = !stringsAsFactors, > na.strings = "NA", colClasses = NA, nrows = -1, > skip = 0, check.names = TRUE, fill = !blank.lines.skip, > strip.white = FALSE, blank.lines.skip = TRUE, > comment.char = "#", > allowEscapes = FALSE, flush = FALSE, > stringsAsFactors = default.stringsAsFactors(), > fileEncoding = "", encoding = "unknown") > > read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", > fill = TRUE, comment.char="", ...) > > read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",", > fill = TRUE, comment.char="", ...) > > read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".", > fill = TRUE, comment.char="", ...) > > read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",", > fill = TRUE, comment.char="", ...) > > > There is really only read.table, the others are just aliases with different > defaults. But the flexibility is great, as you can see. Aren't the newly improved numpy.genfromtxt(fname, dtype=, comments='#', delimiter=None, skiprows=0, converters=None, missing='', missing_values=None, usecols=None, names=None, excludelist=None, deletechars=None, case_sensitive=True, unpack=None, usemask=False, loose=True) and friends indented to handle all this Josef > > -- > --- > | Alan K. Jackson | To see a World in a Grain of Sand | > | a...@ajackson.org | And a Heaven in a Wild Flower, | > | www.ajackson.org | Hold Infinity in the palm of your hand | > | Houston, Texas | And Eternity in an hour. - Blake | > --- > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org h
Re: [Numpy-discussion] fromfile() for reading text (one more time!)
>Hi folks, > >I'm taking a look once again at fromfile() for reading text files. I >often have the need to read a LOT of numbers form a text file, and it >can actually be pretty darn slow do i the normal python way: > >for line in file: >data = map(float, line.strip().split()) > > >or various other versions that are similar. It really does take longer >to read the text, split it up, convert to a number, then put that number >into a numpy array, than it does to simply read it straight into the array. > >However, as it stands, fromfile() turn out to be next to useless for >anything but whitespace separated text. Full set of ideas here: > >http://projects.scipy.org/numpy/ticket/909 > >However, for the moment, I'm digging into the code to address a >particular problem -- reading files like this: > >123, 65.6, 789 >23, 3.2, 34 >... > >That is comma (or whatever) separated text -- pretty common stuff. > >The problem with the current code is that you can't read more than one >line at time with fromfile: > >a = np.fromfile(infile, sep=",") > >will read until it doesn't find a comma, and thus only one line, as >there is no comma after each line. As this is a really typical case, I >think it should be supported. > >Here is the question: > >The work of finding the separator is done in: > >multiarray/ctors.c: fromfile_skip_separator() > >It looks like it wouldn't be too hard to add some code in there to look >for a newline, and consider that a valid separator. However, that would >break backward compatibility. So maybe a flag could be passed in, saying >you wanted to support newlines. The problem is that flag would have to >get passed all the way through to this function (and also for fromstring). > >I also notice that it supports separators of arbitrary length, which I >wonder how useful that is. But it also does odd things with spaces >embedded in the separator: > >", $ #" matches all of: ",$#" ", $#" ",$ #" > >Is it worth trying to fix that? > > >In the longer term, it would be really nice to support comments as well, >tough that would require more of a re-factoring of the code, I think >(though maybe not -- I suppose a call to fromfile_skip_separator() could >look for a comment character, then if it found one, skip to where the >comment ends -- hmmm. > >thanks for any feedback, > >-Chris > I agree. I've tried using it, and usually find that it doesn't quite get there. I rather like the R command(s) for reading text files - except then I have to use R which is painful after using python and numpy. Although ggplot2 is awfully nice too ... but that is a later post. read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown") read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, comment.char="", ...) read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",", fill = TRUE, comment.char="", ...) read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".", fill = TRUE, comment.char="", ...) read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",", fill = TRUE, comment.char="", ...) There is really only read.table, the others are just aliases with different defaults. But the flexibility is great, as you can see. -- --- | Alan K. Jackson| To see a World in a Grain of Sand | | a...@ajackson.org | And a Heaven in a Wild Flower, | | www.ajackson.org | Hold Infinity in the palm of your hand | | Houston, Texas | And Eternity in an hour. - Blake | --- ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion