Warren et al: On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser <warren.weckes...@enthought.com> wrote: > If you are setup with Cython to build extension modules,
I am > and you don't mind > testing an unreleased and experimental reader, and I don't. > you can try the text reader > that I'm working on: https://github.com/WarrenWeckesser/textreader It just took me a while to get around to it! First of all: this is pretty much exactly what I've been looking for for years, and never got around to writing myself - thanks! My comments/suggestions: 1) a docstring for the textreader module would be nice. 2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to parse an ISO datetime string timezone specifier, but short of that, I think the default should be None or UTC -- time zones are too ugly to presume anything! 3) it breaks with the old MacOS style line endings: \r only. Maybe no need to support that, but it turns out one of my old test files still had them! 4) when I try to read more rows than are in the file, I get: File "textreader.pyx", line 247, in textreader.readrows (python/textreader.c:3469) ValueError: negative dimensions are not allowed good to get an error, but it's not very informative! 5) for reading float64 values -- I get something different with textreader than with the python "float()": input: "678.901" float("") : 678.90099999999995 textreader : 678.90100000000007 as close as the number of figures available, but curious... 5) Performance issue: in my case, I'm reading a big file that's in chunks -- each one has a header indicating how many rows follow, then the rows, so I parse it out bit by bit. For smallish files, it's much faster than pure python, and almost as fast as some old C code of mine that is far less flexible. But for large files, -- it's much slower -- indeed slower than a pure python version for my use case. I did a simplified test -- with 10,000 rows: total number of rows: 10000 pure python took: 1.410408 seconds pure python chunks took: 1.613094 seconds textreader all at once took: 0.067098 seconds textreader in chunks took : 0.131802 seconds but with 1,000,000 rows: total number of rows: 1000000 total number of chunks: 1000 pure python took: 30.712564 seconds pure python chunks took: 31.313225 seconds textreader all at once took: 1.314924 seconds textreader in chunks took : 9.684819 seconds then it gets even worse with the chunk size smaller: total number of rows: 1000000 total number of chunks: 10000 pure python took: 30.032246 seconds pure python chunks took: 42.010589 seconds textreader all at once took: 1.318613 seconds textreader in chunks took : 87.743729 seconds my code, which is C that essentially runs fscanf over the file, has essentially no performance hit from doing it in chunks -- so I think something is wrong here. Sorry, I haven't dug into the code to try to figure out what yet -- does it rewind the file each time maybe? Enclosed is my test code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
test_performance.py
Description: Binary data
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion