Re: [Numpy-discussion] Fast Reading of ASCII files
On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey bsout...@gmail.com wrote: ** On 12/14/2011 01:03 AM, Chris Barker wrote: On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: genfromtxt sure looks close for an API This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side. well, yes, though it does do a lot -- do you have a smpler one in mind? Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get. Note that I don't think this should be changed now, that's not worth the trouble. But anyway, the really simple cases, are reallly simle, even with genfromtxt. I guess it's a matter of debate about what is a better API: a few functions, each adding a layer of sophistication or one function, with layers of sophistication added with an array of keyword arguments. There's always a trade-off, but looking at the docstring for genfromtxt should make it an easy call in this case. In either case, though I wish the multiple functionality built on the same, well optimized core code. I wish that too, but I'm fairly certain that you can't write that core code with the ability to handle missing and irregular data and make it close to the same speed as an optimized reader for regular data. I am not sure that you can even create a simple API here as even Python's csv module is rather complex especially when it just reads data as strings. It also 'hides' many arguments in the Dialect class although these are just the collection of 7 'fmtparam' arguments. It also provides the Sniffer class that tries to find correct format that can then be passed to the reader function. Then you still have to convert the data into the required types - another set of arguments as well as yet another pass through the data. In comparison, genfromtxt can perform sniffing I assume you mean the ``dtype=None`` example in the docstring? That works to some extent, but you still need to specify the delimiter. I commented on that on the loadtable PR. and both genfromtxt and loadtxt can read and convert the data. These also add some useful features like skipping rows (start, end and commented) and columns. However, it could be possible to create a sniffer function and a single data reader function leading to a 'simple' reader function but that probably would not change the API of the underlying data reader function. Better auto-detection of things like delimiters would indeed be quite useful. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Wed, Dec 14, 2011 at 1:22 PM, Ralf Gommers ralf.gomm...@googlemail.comwrote: On Wed, Dec 14, 2011 at 4:11 PM, Bruce Southey bsout...@gmail.com wrote: ** On 12/14/2011 01:03 AM, Chris Barker wrote: On Tue, Dec 13, 2011 at 1:21 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: genfromtxt sure looks close for an API This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side. well, yes, though it does do a lot -- do you have a smpler one in mind? Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get. Just my two cents (and I was one of those who championed its inclusion), the ndmin feature is designed to prevent unexpected results that users (particularly beginners) may encounter with their datasets. Now, maybe it might be difficult to tell a beginner *why* they might need to be aware of it, but it is very easy to describe *how* to use. How many dimensions is your data? Two? Ok, just set ndmin=2 and you are good to go! Cheers! Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root ben.r...@ou.edu wrote: well, yes, though it does do a lot -- do you have a smpler one in mind? Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get. this may be a function of a well written doc string -- if it is clear to the newbie that all the rest of this you don't need unless you have a wierd data file, then extra keyword arguments don't really hurt. A few examples of the basic use-cases go a long way. And yes, the core reader for the complex cases isn't going to e fast (it's going to be complex C code...). but we could still have a core reader that handled most cases. Anyway, I think it's time write code, and see if it can be rolled in somehow... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Wed, Dec 14, 2011 at 9:54 PM, Chris Barker chris.bar...@noaa.gov wrote: On Wed, Dec 14, 2011 at 11:36 AM, Benjamin Root ben.r...@ou.edu wrote: well, yes, though it does do a lot -- do you have a smpler one in mind? Just looking at what I normally wouldn't need for simple data files and/or what a beginning user won't understand at once, the `unpack` and `ndmin` keywords could certainly be left out. `converters` is also questionable. That's probably as simple as it can get. this may be a function of a well written doc string -- if it is clear to the newbie that all the rest of this you don't need unless you have a wierd data file, then extra keyword arguments don't really hurt. A few examples of the basic use-cases go a long way. And yes, the core reader for the complex cases isn't going to e fast (it's going to be complex C code...). but we could still have a core reader that handled most cases. Okay, now we're on the same page I think. Anyway, I think it's time write code, and see if it can be rolled in somehow... Agreed. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
NOTE: Let's keep this on the list. On Tue, Dec 13, 2011 at 9:19 AM, denis denis-bz...@t-online.de wrote: Chris, unified, consistent save / load is a nice goal 1) header lines with date, pwd etc.: where'd this come from ? # (5, 5) svm.py bz/py/ml/svm 2011-12-13 Dec 11:56 -- automatic # 80.6 % correct -- user info 24539 4 526 ... I'm not sure I understand what you are expecting here: What would be automatic? if itparses a datetime on the header, what would it do with it? But anyway, this seems to me: - very application specific -- this is for the users code to write - not what we are talking about at this point anyway -- I think this discussion is about a lower-level, does-the-simple-things-fast reader -- that may or may not be able to form the basis of a higher-level fuller featured reader. 2) read any CSVs: comma or blank-delimited, with/without column names, a la loadcsv() below yup -- though the column name reading would be part of a higher-level reader as far as I'm concerned. 3) sparse or masked arrays ? sparse probably not, that seem pretty domain dependent to me -- though hopefully one could build such a thing on top of the lower level reader. Masked support would be good -- once we're convinced what the future of masked arrays are in numpy. I was thinking that the masked array issue would really be a higher-level feature -- it certainly could be if you need to mask special value stype files (i.e. ), but we may have to build it into the lower level reader for cases where the mask is specified by non-numerical values -- i.e. there are some met files that use MM or some other text, so you can't put it into a numerical array first. Longterm wishes: beyond the scope of one file - one array but essential for larger projects: 1) dicts / dotdicts: Dotdict( A=anysizearray, N=scalar ... ) - a directory of little files is easy, better than np.savez (Haven't used hdf5, I believe Matlabv7 does.) 2) workflows: has anyone there used visTrails ? outside of the spec of this thread... Anyway it seems to me (old grey cynic) that Numpy/scipy developers prefer to code first, spec and doc later. Too pessimistic ? Well, I think many of us believe in a more agile style approach -- incremental development. But really, as an open source project, it's really about scratching an itch -- so there is usually a spec in mind for the itch at hand. In this case, however, that has been a weakness -- clearly a number of us hav written small solutions to our particular problem at hand, but no we haven't arrived at a more general purpose solution yet. So a bit of spec-ing ahead of time may be called for. On that: Ive been thinking from teh botom-up -- imaging what I need for the simple case, and how it might apply to more complex cases -- but maybe we should think about this another way: What we're talking about here is really about core software engineering -- optimization. It's easy to write a pure-python simple file parser, and reasonable to write a complex one (genfromtxt) -- the issue is performance -- we need some more C (or Cython) code to really speed it up, but none of us wants to write the complex case code in C. So: genfromtxt is really nice for many of the complex cases. So perhaps another approach is to look at genfromtxt, and see what high performance lower-level functionality we could develop that could make it fast -- then we are done. This actually mirrors exactly what we all usually recommend for python development in general -- write it in Python, then, if it's really not fast enough, write the bottle-neck in C. So where are the bottle necks in genfromtxt? Are there self-contained portions that could be re-written in C/Cython? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On 12/13/2011 12:08 PM, Chris Barker wrote: NOTE: Let's keep this on the list. On Tue, Dec 13, 2011 at 9:19 AM, denis denis-bz...@t-online.de mailto:denis-bz...@t-online.de wrote: Chris, unified, consistent save / load is a nice goal 1) header lines with date, pwd etc.: where'd this come from ? # (5, 5) svm.py bz/py/ml/svm 2011-12-13 Dec 11:56 -- automatic # 80.6 % correct -- user info 24539 4 526 ... I'm not sure I understand what you are expecting here: What would be automatic? if itparses a datetime on the header, what would it do with it? But anyway, this seems to me: - very application specific -- this is for the users code to write - not what we are talking about at this point anyway -- I think this discussion is about a lower-level, does-the-simple-things-fast reader -- that may or may not be able to form the basis of a higher-level fuller featured reader. 2) read any CSVs: comma or blank-delimited, with/without column names, a la loadcsv() below yup -- though the column name reading would be part of a higher-level reader as far as I'm concerned. 3) sparse or masked arrays ? sparse probably not, that seem pretty domain dependent to me -- though hopefully one could build such a thing on top of the lower level reader. Masked support would be good -- once we're convinced what the future of masked arrays are in numpy. I was thinking that the masked array issue would really be a higher-level feature -- it certainly could be if you need to mask special value stype files (i.e. ), but we may have to build it into the lower level reader for cases where the mask is specified by non-numerical values -- i.e. there are some met files that use MM or some other text, so you can't put it into a numerical array first. Longterm wishes: beyond the scope of one file - one array but essential for larger projects: 1) dicts / dotdicts: Dotdict( A=anysizearray, N=scalar ... ) - a directory of little files is easy, better than np.savez (Haven't used hdf5, I believe Matlabv7 does.) 2) workflows: has anyone there used visTrails ? outside of the spec of this thread... Anyway it seems to me (old grey cynic) that Numpy/scipy developers prefer to code first, spec and doc later. Too pessimistic ? Well, I think many of us believe in a more agile style approach -- incremental development. But really, as an open source project, it's really about scratching an itch -- so there is usually a spec in mind for the itch at hand. In this case, however, that has been a weakness -- clearly a number of us hav written small solutions to our particular problem at hand, but no we haven't arrived at a more general purpose solution yet. So a bit of spec-ing ahead of time may be called for. On that: Ive been thinking from teh botom-up -- imaging what I need for the simple case, and how it might apply to more complex cases -- but maybe we should think about this another way: What we're talking about here is really about core software engineering -- optimization. It's easy to write a pure-python simple file parser, and reasonable to write a complex one (genfromtxt) -- the issue is performance -- we need some more C (or Cython) code to really speed it up, but none of us wants to write the complex case code in C. So: genfromtxt is really nice for many of the complex cases. So perhaps another approach is to look at genfromtxt, and see what high performance lower-level functionality we could develop that could make it fast -- then we are done. This actually mirrors exactly what we all usually recommend for python development in general -- write it in Python, then, if it's really not fast enough, write the bottle-neck in C. So where are the bottle necks in genfromtxt? Are there self-contained portions that could be re-written in C/Cython? -Chris Reading data is hard and writing code that suits the diversity in the Numerical Python community is even harder! Both loadtxt and genfromtxt functions (other functions are perhaps less important) perhaps need an upgrade to incorporate the new NA object. I think that adding the NA object will simply some of the process because invalid data (missing or a string in a numerical format) can be set to NA without requiring the creation of a new masked array or returning an error. Here I think loadtxt is a better target than genfromtxt because, as I understand it, it assumes the user really knows the data. Whereas genfromtxt can ask the data for the appropriatye format. So I agree that new 'superfast custom CSV reader for well-behaved data' function would be rather useful especially as an replacement for loadtxt. By that I mean reading data using a user specified format that essentially follows the CSV format (http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to allow for
Re: [Numpy-discussion] Fast Reading of ASCII files
On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey bsout...@gmail.com wrote: ** Reading data is hard and writing code that suits the diversity in the Numerical Python community is even harder! yup Both loadtxt and genfromtxt functions (other functions are perhaps less important) perhaps need an upgrade to incorporate the new NA object. yes, if we are satisfiedthat the new NA object is, in fact, the way of the future. Here I think loadtxt is a better target than genfromtxt because, as I understand it, it assumes the user really knows the data. Whereas genfromtxt can ask the data for the appropriatye format. So I agree that new 'superfast custom CSV reader for well-behaved data' function would be rather useful especially as an replacement for loadtxt. By that I mean reading data using a user specified format that essentially follows the CSV format ( http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to allow for NA object, skipping lines and user-defined delimiters. I think that ideally, there could be one interface to reading tabular data -- hopefully, it would be easy for the user to specify what the want, and if they don't the code tries to figure it out. Also, under the hood, the easy cases are special-cased to high-performing versions. genfromtxt sure looks close for an API -- it just needs the high performance special cases under the hood. It may be that the way it's designed makes it very difficult to do that, though -- I haven't looked closely enough to tell. At least that's what I'm thinking at the moment. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Tue, Dec 13, 2011 at 10:07 PM, Chris Barker chris.bar...@noaa.govwrote: On Tue, Dec 13, 2011 at 11:29 AM, Bruce Southey bsout...@gmail.comwrote: ** Reading data is hard and writing code that suits the diversity in the Numerical Python community is even harder! yup Both loadtxt and genfromtxt functions (other functions are perhaps less important) perhaps need an upgrade to incorporate the new NA object. yes, if we are satisfiedthat the new NA object is, in fact, the way of the future. Here I think loadtxt is a better target than genfromtxt because, as I understand it, it assumes the user really knows the data. Whereas genfromtxt can ask the data for the appropriatye format. So I agree that new 'superfast custom CSV reader for well-behaved data' function would be rather useful especially as an replacement for loadtxt. By that I mean reading data using a user specified format that essentially follows the CSV format ( http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are to allow for NA object, skipping lines and user-defined delimiters. I think that ideally, there could be one interface to reading tabular data -- hopefully, it would be easy for the user to specify what the want, and if they don't the code tries to figure it out. Also, under the hood, the easy cases are special-cased to high-performing versions. genfromtxt sure looks close for an API This I don't agree with. It has a huge amount of keywords that just confuse or intimidate a beginning user. There should be a dead simple interface, even the loadtxt API is on the heavy side. Ralf -- it just needs the high performance special cases under the hood. It may be that the way it's designed makes it very difficult to do that, though -- I haven't looked closely enough to tell. At least that's what I'm thinking at the moment. -Chris ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Mon, Dec 12, 2011 at 12:34 PM, Warren Weckesser warren.weckes...@enthought.com wrote: On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker chris.bar...@noaa.gov wrote: On 12/11/11 8:40 AM, Ralf Gommers wrote: On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. I don't think the version in my mind is contradictory (Not quite). What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases. I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that. Hmm, maybe there is -- for instance, if a record consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step. Maybe that wouldn't be any faster or easier, but it could be done... Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser. Would anyone like to join/form a small working group to work on this? Wes, I'd like to see your Cython version -- maybe a starting point? -Chris I'm also working on a faster text file reader, so count me in. I've been experimenting in both C and Cython. I'll put it on github as soon as I can. Warren -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Cool, Warren, I look forward to seeing it. I'm hopeful we can craft a performant tool that will meet the needs of of many projects (NumPy, pandas, etc.)... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On 12/11/11 8:40 AM, Ralf Gommers wrote: On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. I don't think the version in my mind is contradictory (Not quite). What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases. I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that. Hmm, maybe there is -- for instance, if a record consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step. Maybe that wouldn't be any faster or easier, but it could be done... Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser. Would anyone like to join/form a small working group to work on this? Wes, I'd like to see your Cython version -- maybe a starting point? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Mon, Dec 12, 2011 at 10:22 AM, Chris.Barker chris.bar...@noaa.govwrote: On 12/11/11 8:40 AM, Ralf Gommers wrote: On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. I don't think the version in my mind is contradictory (Not quite). What I'm imagining is that a good, fast ascii to numpy array reader could read a whole table in at once (the common, easy, fast, case), but it could also be used to read snippets of a file in at a time, which could be leveraged to handle many of the more complex cases. I suppose there will always be cases where the user needs to write their own converter from string to dtype, and there is simply no way to leverage what I'm imagining to supported that. Hmm, maybe there is -- for instance, if a record consisted off mostly standard, easy-to-parse, numbers, but one field was some weird text that needed custom parsing, we could read it as a dtype, with a string for that one weird field, and that could be converted in a post-processing step. Maybe that wouldn't be any faster or easier, but it could be done... Anyway, whether you can leverage it for the full-featured version or not, I do think there is call for a good, fast, 90% case text file parser. Would anyone like to join/form a small working group to work on this? Wes, I'd like to see your Cython version -- maybe a starting point? -Chris I'm also working on a faster text file reader, so count me in. I've been experimenting in both C and Cython. I'll put it on github as soon as I can. Warren -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fast Reading of ASCII files
On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker chris.bar...@noaa.gov wrote: Hi folks, This is a continuation of a conversation already started, but i gave it a new, more appropriate, thread and subject. On 12/6/11 2:13 PM, Wes McKinney wrote: we should start talking about building a *high performance* flat file loading solution with good column type inference and sensible defaults, etc. ... I personally don't believe in sacrificing an order of magnitude of performance in the 90% case for the 10% case-- so maybe it makes sense to have two functions around: a superfast custom CSV reader for well-behaved data, and a slower, but highly flexible, function like loadtable to fall back on. I've wanted this for ages, and have done some work towards it, but like others, only had the time for a my-use-case-specific solution. A few thoughts: * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. A very fast reader for well-behave files would be very welcome, but I see it as a separate topic from genfromtxt/loadtable. The question for the loadtable pull request is whether it is different enough from genfromtxt that we need/want both, or whether loadtable should replace genfromtxt. Cheers, Ralf * key to performance is to have the text to number to numpy type happening in C -- if you read the text with python, then convert to numbers, then to numpy arrays, it's simple going to be slow. * I think we want a solution that can be adapted to arbitrary text files -- not just tabular, CSV-style data. I have a lot of those to read - and some thoughts about how. Efforts I have made so far, and what I've learned from them: 1) fromfile(): fromfile (for text) is nice and fast, but buggy, and a bit too limited. I've posted various notes about this in the past (and, I'm pretty sure a couple tickets). They key missing features are: a) no support form commented lines (this is a lessor need, I think) b) there can be only one delimiter, and newlines are treated as generic whitespace. What this means is that if you have whitespace-delimited file, you can read multiple lines, but if it is, for instance, comma-delimited, then you can only read one line at a time, killing performance. c) there are various bugs if the text is malformed, or doesn't quite match what you're asking for (ie.e reading integers, but the tet is float) -- mostly really limited error checking. I spent some time digging into the code, and found it to be really hard to track C code. And very hard to update. The core idea is pretty nice -- each dtype should know how to read itself form a text file, but the implementation is painful. The key issue is that for floats and ints, anyway, it relies on the C atoi and atof functions. However, there have been patches to these that handle NaN better, etc, for numpy, and I think a python patch as well. So the code calls the numpy atoi, which does some checks, then calls the python atoi, which then calls the C lib atoi (I think all that...) In any case, the core bugs are due to the fact that atoi and friends doesn't return an error code, so you have to check if the pointer has been incremented to see if the read was successful -- this error checking is not propagated through all those levels of calls. It got really ugly to try to fix! Also, the use of the C atoi() means that locales may only be handled in the default way -- i.e. no way to read european-style floats on a system with a US locale. My conclusion -- the current code is too much a mess to try to deal with and fix! I also think it's a mistake to have text file reading a special case of fromfile(), it really should be a separate issue, though that's a minor API question. 2) FileScanner: FileScanner is some code a wrote years ago as a C extension - it's limited, but does the job and is pretty fast. It essentially calls fscanf() as many times as it gets a successful scan, skipping all invalid text, then returning a numpy array. You can also specify how many numbers you want read from the file. It only supports floats. Travis O. asked it it could be included in Scipy way back when, but I suspect none of my code actually made it in. If I had to do it again, I might write something similar in Cython, though I am still using it. My Conclusions: I think what we need is something similar to MATLAB's fscanf(): what it does is take a C-style format string, and apply it to your file over an over again as many times as it can, and returns an array. What's nice about this is
[Numpy-discussion] Fast Reading of ASCII files
Hi folks, This is a continuation of a conversation already started, but i gave it a new, more appropriate, thread and subject. On 12/6/11 2:13 PM, Wes McKinney wrote: we should start talking about building a *high performance* flat file loading solution with good column type inference and sensible defaults, etc. ... I personally don't believe in sacrificing an order of magnitude of performance in the 90% case for the 10% case-- so maybe it makes sense to have two functions around: a superfast custom CSV reader for well-behaved data, and a slower, but highly flexible, function like loadtable to fall back on. I've wanted this for ages, and have done some work towards it, but like others, only had the time for a my-use-case-specific solution. A few thoughts: * If we have a good, fast ascii (or unicode?) to array reader, hopefully it could be leveraged for use in the more complex cases. So that rather than genfromtxt() being written from scratch, it would be a wrapper around the lower-level reader. * key to performance is to have the text to number to numpy type happening in C -- if you read the text with python, then convert to numbers, then to numpy arrays, it's simple going to be slow. * I think we want a solution that can be adapted to arbitrary text files -- not just tabular, CSV-style data. I have a lot of those to read - and some thoughts about how. Efforts I have made so far, and what I've learned from them: 1) fromfile(): fromfile (for text) is nice and fast, but buggy, and a bit too limited. I've posted various notes about this in the past (and, I'm pretty sure a couple tickets). They key missing features are: a) no support form commented lines (this is a lessor need, I think) b) there can be only one delimiter, and newlines are treated as generic whitespace. What this means is that if you have whitespace-delimited file, you can read multiple lines, but if it is, for instance, comma-delimited, then you can only read one line at a time, killing performance. c) there are various bugs if the text is malformed, or doesn't quite match what you're asking for (ie.e reading integers, but the tet is float) -- mostly really limited error checking. I spent some time digging into the code, and found it to be really hard to track C code. And very hard to update. The core idea is pretty nice -- each dtype should know how to read itself form a text file, but the implementation is painful. The key issue is that for floats and ints, anyway, it relies on the C atoi and atof functions. However, there have been patches to these that handle NaN better, etc, for numpy, and I think a python patch as well. So the code calls the numpy atoi, which does some checks, then calls the python atoi, which then calls the C lib atoi (I think all that...) In any case, the core bugs are due to the fact that atoi and friends doesn't return an error code, so you have to check if the pointer has been incremented to see if the read was successful -- this error checking is not propagated through all those levels of calls. It got really ugly to try to fix! Also, the use of the C atoi() means that locales may only be handled in the default way -- i.e. no way to read european-style floats on a system with a US locale. My conclusion -- the current code is too much a mess to try to deal with and fix! I also think it's a mistake to have text file reading a special case of fromfile(), it really should be a separate issue, though that's a minor API question. 2) FileScanner: FileScanner is some code a wrote years ago as a C extension - it's limited, but does the job and is pretty fast. It essentially calls fscanf() as many times as it gets a successful scan, skipping all invalid text, then returning a numpy array. You can also specify how many numbers you want read from the file. It only supports floats. Travis O. asked it it could be included in Scipy way back when, but I suspect none of my code actually made it in. If I had to do it again, I might write something similar in Cython, though I am still using it. My Conclusions: I think what we need is something similar to MATLAB's fscanf(): what it does is take a C-style format string, and apply it to your file over an over again as many times as it can, and returns an array. What's nice about this is that it can be purposed to efficiently read a wide variety of text files fast. For numpy, I imagine something like: fromtextfile(f, dtype=np.float64, comment=None, shape=None): read data from a text file, returning a numpy array f: is a filename or file-like object comment: is a string of the comment signifier. Anything on a line after this string will be ignored. dytpe: is a numpy dtype that you want read from the file shape: is the shape of the resulting array. If shape==None, the file will be read until EOF or until there is read error. By default,