Elegant solution. Very readable and takes care of row0 nicely. I want to point out that this is much more efficient than my version for random/late string representation changes throughout the conversion but it suffers from 2*n memory footprint and large block copying if the string rep changes arrives very early on huge datasets. I think we can't have best of both and Tims solution is better in the general case.
Maybe "use one_alt if rownumber < xxx else use other_alt" can fine-tune performance for some cases. but even ten, with many cols, it's nearly impossible to know. //Torgil On 7/9/07, Timothy Hochberg <[EMAIL PROTECTED]> wrote: > > > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote: > > Thanks for looking into this Torgil! I agree that this is a much more > > complicated setup. I'll check if there is anything I can do on the data > end. > > Otherwise I'll go with Timothy's suggestion and read in numbers as floats > > and convert to int later as needed. > > Here is a strategy that should allow auto detection without too much in the > way of inefficiency. The basic idea is to convert till you run into a > problem, store that data away, and continue the conversion with a new dtype. > At the end you assemble all the chunks of data you've accumulated into one > large array. It should be reasonably efficient in terms of both memory and > speed. > > The implementation is a little rough, but it should get the idea across. > > -- > . __ > . |-\ > . > . [EMAIL PROTECTED] > > ======================================================================== > > def find_formats(items, last): > formats = [] > for i, x in enumerate(items): > dt, cvt = string_to_dt_cvt(x) > if last is not None: > last_cvt, last_dt = last[i] > if last_cvt is float and cvt is int: > cvt = float > formats.append((dt, cvt)) > return formats > > class LoadInfo(object): > def __init__(self, row0): > self.done = False > self.lastcols = None > self.row0 = row0 > > def data_iterator(lines, converters, delim, info): > yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim))) > try: > for row in lines: > yield tuple(f(x) for f, x in zip(converters, row.split(delim))) > except: > info.row0 = row > else: > info.done = True > > def load2(fname,delim = ',', has_varnm = True, prn_report = True): > """ > Loading data from a file using the csv module. Returns a recarray. > """ > f=open(fname,'rb') > > if has_varnm: > varnames = [i.strip() for i in f.next().split(delim)] > else: > varnames = None > > > info = LoadInfo(f.next()) > chunks = [] > > while not info.done: > row0 = info.row0.split(delim) > formats = find_formats(row0, info.lastcols ) > if varnames is None: > varnames = varnm = ['col%s' % str(i+1) for i, _ in > enumerate(formate)] > descr=[] > conversion_functions=[] > for name, (dtype, cvt_fn) in zip(varnames, formats): > descr.append((name,dtype)) > conversion_functions.append(cvt_fn) > > chunks.append(N.fromiter(data_iterator(f, conversion_functions, > delim, info), descr)) > > if len(chunks) > 1: > n = sum(len(x) for x in chunks) > data = N.zeros([n], chunks[-1].dtype) > offset = 0 > for x in chunks: > delta = len(x) > data[offset:offset+delta] = x > offset += delta > else: > [data] = chunks > > # load report > if prn_report: > print > "##########################################\n" > print "Loaded file: %s\n" % fname > print "Nr obs: %s\n" % data.shape[0] > print "Variables and datatypes:\n" > for i in data.dtype.descr: > print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], > str(data[i[0]][0:3])) > print > "\n##########################################\n" > > return data > > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion