On Tue, Mar 20, 2012 at 5:59 PM, Chris Barker <chris.bar...@noaa.gov> wrote:
> Warren et al: > > On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser > <warren.weckes...@enthought.com> wrote: > > If you are setup with Cython to build extension modules, > > I am > > > and you don't mind > > testing an unreleased and experimental reader, > > and I don't. > > > you can try the text reader > > that I'm working on: https://github.com/WarrenWeckesser/textreader > > It just took me a while to get around to it! > > First of all: this is pretty much exactly what I've been looking for > for years, and never got around to writing myself - thanks! > > My comments/suggestions: > > 1) a docstring for the textreader module would be nice. > > 2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to > parse an ISO datetime string timezone specifier, but short of that, I > think the default should be None or UTC -- time zones are too ugly to > presume anything! > > 3) it breaks with the old MacOS style line endings: \r only. Maybe no > need to support that, but it turns out one of my old test files still > had them! > > 4) when I try to read more rows than are in the file, I get: > File "textreader.pyx", line 247, in textreader.readrows > (python/textreader.c:3469) > ValueError: negative dimensions are not allowed > > good to get an error, but it's not very informative! > > 5) for reading float64 values -- I get something different with > textreader than with the python "float()": > input: "678.901" > float("") : 678.90099999999995 > textreader : 678.90100000000007 > > as close as the number of figures available, but curious... > > > 5) Performance issue: in my case, I'm reading a big file that's in > chunks -- each one has a header indicating how many rows follow, then > the rows, so I parse it out bit by bit. > For smallish files, it's much faster than pure python, and almost as > fast as some old C code of mine that is far less flexible. > > But for large files, -- it's much slower -- indeed slower than a pure > python version for my use case. > > I did a simplified test -- with 10,000 rows: > > total number of rows: 10000 > pure python took: 1.410408 seconds > pure python chunks took: 1.613094 seconds > textreader all at once took: 0.067098 seconds > textreader in chunks took : 0.131802 seconds > > but with 1,000,000 rows: > > total number of rows: 1000000 > total number of chunks: 1000 > pure python took: 30.712564 seconds > pure python chunks took: 31.313225 seconds > textreader all at once took: 1.314924 seconds > textreader in chunks took : 9.684819 seconds > > then it gets even worse with the chunk size smaller: > > total number of rows: 1000000 > total number of chunks: 10000 > pure python took: 30.032246 seconds > pure python chunks took: 42.010589 seconds > textreader all at once took: 1.318613 seconds > textreader in chunks took : 87.743729 seconds > > my code, which is C that essentially runs fscanf over the file, has > essentially no performance hit from doing it in chunks -- so I think > something is wrong here. > > Sorry, I haven't dug into the code to try to figure out what yet -- > does it rewind the file each time maybe? > > Enclosed is my test code. > > -Chris > Chris, Thanks! The feedback is great. I won't have time to get back to this for another week or so, but then I'll look into the issues you reported. Warren > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion