On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.shel...@gmail.com> wrote: > Wes - > > I designed the recfile package to fill this need. It might be a start. > > Some features: > > - the ability to efficiently read any subset of the data without > loading the whole file. > - reads directly into a recarray, so no overheads. > - object oriented interface, mimicking recarray slicing. > - also supports writing > > Currently it is fixed-width fields only. It is C++, but wouldn't be > hard to convert it C if that is a requirement. Also, it works for > binary or ascii. > > http://code.google.com/p/recfile/ > > the trunk is pretty far past the most recent release. > > Erin Scott Sheldon
Can you relicense as BSD-compatible? > Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: >> dear all, >> >> I haven't read all 180 e-mails, but I didn't see this on Travis's >> initial list. >> >> All of the existing flat file reading solutions I have seen are >> not suitable for many applications, and they compare very unfavorably >> to tools present in other languages, like R. Here are some of the >> main issues I see: >> >> - Memory usage: creating millions of Python objects when reading >> a large file results in horrendously bad memory utilization, >> which the Python interpreter is loathe to return to the >> operating system. Any solution using the CSV module (like >> pandas's parsers-- which are a lot faster than anything else I >> know of in Python) suffers from this problem because the data >> come out boxed in tuples of PyObjects. Try loading a 1,000,000 >> x 20 CSV file into a structured array using np.genfromtxt or >> into a DataFrame using pandas.read_csv and you will immediately >> see the problem. R, by contrast, uses very little memory. >> >> - Performance: post-processing of Python objects results in poor >> performance. Also, for the actual parsing, anything regular >> expression based (like the loadtable effort over the summer, >> all apologies to those who worked on it), is doomed to >> failure. I think having a tool with a high degree of >> compatibility and intelligence for parsing unruly small files >> does make sense though, but it's not appropriate for large, >> well-behaved files. >> >> - Need to "factorize": as soon as there is an enum dtype in >> NumPy, we will want to enable the file parsers for structured >> arrays and DataFrame to be able to "factorize" / convert to >> enum certain columns (for example, all string columns) during >> the parsing process, and not afterward. This is very important >> for enabling fast groupby on large datasets and reducing >> unnecessary memory usage up front (imagine a column with a >> million values, with only 10 unique values occurring). This >> would be trivial to implement using a C hash table >> implementation like khash.h >> >> To be clear: I'm going to do this eventually whether or not it >> happens in NumPy because it's an existing problem for heavy >> pandas users. I see no reason why the code can't emit structured >> arrays, too, so we might as well have a common library component >> that I can use in pandas and specialize to the DataFrame internal >> structure. >> >> It seems clear to me that this work needs to be done at the >> lowest level possible, probably all in C (or C++?) or maybe >> Cython plus C utilities. >> >> If anyone wants to get involved in this particular problem right >> now, let me know! >> >> best, >> Wes > -- > Erin Scott Sheldon > Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion