Wes - I designed the recfile package to fill this need. It might be a start.
Some features: - the ability to efficiently read any subset of the data without loading the whole file. - reads directly into a recarray, so no overheads. - object oriented interface, mimicking recarray slicing. - also supports writing Currently it is fixed-width fields only. It is C++, but wouldn't be hard to convert it C if that is a requirement. Also, it works for binary or ascii. http://code.google.com/p/recfile/ the trunk is pretty far past the most recent release. Erin Scott Sheldon Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: > dear all, > > I haven't read all 180 e-mails, but I didn't see this on Travis's > initial list. > > All of the existing flat file reading solutions I have seen are > not suitable for many applications, and they compare very unfavorably > to tools present in other languages, like R. Here are some of the > main issues I see: > > - Memory usage: creating millions of Python objects when reading > a large file results in horrendously bad memory utilization, > which the Python interpreter is loathe to return to the > operating system. Any solution using the CSV module (like > pandas's parsers-- which are a lot faster than anything else I > know of in Python) suffers from this problem because the data > come out boxed in tuples of PyObjects. Try loading a 1,000,000 > x 20 CSV file into a structured array using np.genfromtxt or > into a DataFrame using pandas.read_csv and you will immediately > see the problem. R, by contrast, uses very little memory. > > - Performance: post-processing of Python objects results in poor > performance. Also, for the actual parsing, anything regular > expression based (like the loadtable effort over the summer, > all apologies to those who worked on it), is doomed to > failure. I think having a tool with a high degree of > compatibility and intelligence for parsing unruly small files > does make sense though, but it's not appropriate for large, > well-behaved files. > > - Need to "factorize": as soon as there is an enum dtype in > NumPy, we will want to enable the file parsers for structured > arrays and DataFrame to be able to "factorize" / convert to > enum certain columns (for example, all string columns) during > the parsing process, and not afterward. This is very important > for enabling fast groupby on large datasets and reducing > unnecessary memory usage up front (imagine a column with a > million values, with only 10 unique values occurring). This > would be trivial to implement using a C hash table > implementation like khash.h > > To be clear: I'm going to do this eventually whether or not it > happens in NumPy because it's an existing problem for heavy > pandas users. I see no reason why the code can't emit structured > arrays, too, so we might as well have a common library component > that I can use in pandas and specialize to the DataFrame internal > structure. > > It seems clear to me that this work needs to be done at the > lowest level possible, probably all in C (or C++?) or maybe > Cython plus C utilities. > > If anyone wants to get involved in this particular problem right > now, let me know! > > best, > Wes -- Erin Scott Sheldon Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion