Re: [Numpy-discussion] Buildbot for numpy

2007-07-08 Thread Albert Strasheim
Hello

On Mon, 02 Jul 2007, Barry Wark wrote:

> I have the potential to add OS X Server Intel (64-bit) and OS X Intel
> (32-bit) to the list, if I can convince my boss that the security risk

Sounds good. We could definitely use these platforms.

> (including DOS from compile times) is minimal. I've compiled both

Currently we don't allow builds to be forced from the web page, but this 
might change in future.

> numpy and scipy many times, so I'm not worried about resources for a
> single compile/test, but can any of the regular developers tell me
> about how many commits there are per day that will trigger a
> compile/test?

We currently only build NumPy. SciPy should probably be added at some 
point, once we figure out how we want to configure the Buildbot to do 
this. NumPy averages close to 0 commits per day at this point. SciPy is 
more active. Between the two, on a busy day, you could expect more than 
10 and less than 100 builds.
 
> About the more general security risk of running a buildbot slave, from
> my reading of the buildbot manual (not the source, yet), it looks like
> the slave is a Twisted server that runs as a normal user process. Is
> there any sort of sandboxing built into the buildbot slave or is that
> the responsibility of the OS (an issue I'll have to discuss with our
> IT)?

Through the buildbot master configuration, we tell your buildslave what 
to check out and which commands to execute. We have set it up to do the 
build in terms of a Makefile, so the master will tell the slave to run 
"make build" followed by "make test". Here you can make your own 
machine do anything that hopefully involves running python setup.py, 
etc. However, the configuration on the master can be changed to make 
your slave execute any command.

In short, any NumPy/SciPy committer or anyone who controls the build 
master configuration (i.e., me, Stefan, our admin person, a few other 
people who have root access on that machine and anybody who 
successfully breaks into it) can make your build machine execute 
arbitrary code as the build slave user.

The chance of this happening is small, but it's not impossible, so if 
this risk is unacceptable to you/your IT people, running a build slave 
might not be for you. ;-)

Cheers,

Albert
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Timothy Hochberg

On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:


Thanks for looking into this Torgil! I agree that this is a much more
complicated setup. I'll check if there is anything I can do on the data
end.
Otherwise I'll go with Timothy's suggestion and read in numbers as floats
and convert to int later as needed.



Here is a strategy that should allow auto detection without too much in the
way of inefficiency. The basic idea is to convert till you run into a
problem, store that data away, and continue the conversion with a new dtype.
At the end you assemble all the chunks of data you've accumulated into one
large array. It should be reasonably efficient in terms of both memory and
speed.

The implementation is a little rough, but it should get the idea across.

--
.  __
.   |-\
.
.  [EMAIL PROTECTED]



def find_formats(items, last):
   formats = []
   for i, x in enumerate(items):
   dt, cvt = string_to_dt_cvt(x)
   if last is not None:
   last_cvt, last_dt = last[i]
   if last_cvt is float and cvt is int:
   cvt = float
   formats.append((dt, cvt))
   return formats

class LoadInfo(object):
   def __init__(self, row0):
   self.done = False
   self.lastcols = None
   self.row0 = row0

def data_iterator(lines, converters, delim, info):
   yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim)))
   try:
   for row in lines:
   yield tuple(f(x) for f, x in zip(converters, row.split(delim)))
   except:
   info.row0 = row
   else:
   info.done = True

def load2(fname,delim = ',', has_varnm = True, prn_report = True):
   """
   Loading data from a file using the csv module. Returns a recarray.
   """
   f=open(fname,'rb')

   if has_varnm:
   varnames = [i.strip() for i in f.next().split(delim)]
   else:
   varnames = None


   info = LoadInfo(f.next())
   chunks = []

   while not info.done:
   row0 = info.row0.split(delim)
   formats = find_formats(row0, info.lastcols)
   if varnames is None:
   varnames = varnm = ['col%s' % str(i+1) for i, _ in
enumerate(formate)]
   descr=[]
   conversion_functions=[]
   for name, (dtype, cvt_fn) in zip(varnames, formats):
   descr.append((name,dtype))
   conversion_functions.append(cvt_fn)

   chunks.append(N.fromiter(data_iterator(f, conversion_functions,
delim, info), descr))

   if len(chunks) > 1:
   n = sum(len(x) for x in chunks)
   data = N.zeros([n], chunks[-1].dtype)
   offset = 0
   for x in chunks:
   delta = len(x)
   data[offset:offset+delta] = x
   offset += delta
   else:
   [data] = chunks

   # load report
   if prn_report:
   print "##\n"
   print "Loaded file: %s\n" % fname
   print "Nr obs: %s\n" % data.shape[0]
   print "Variables and datatypes:\n"
   for i in data.dtype.descr:
   print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1],
str(data[i[0]][0:3]))
   print "\n##\n"

   return data
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Torgil Svensson
FWIW

>>> n,dt=descr[0]
>>> new_dt=dt.replace('f','i')
>>> descr[0]=(n,new_dt)
>>> data=ra.col1.astype(new_dt)
>>> ra.dtype=N.dtype(descr)
>>> ra.col1=data

//Torgil

On 7/9/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>
>  Tim,
>
>  I do want to auto-detect. Reading numbers in as floats is probably not a
> huge penalty.
>
>  Is there an easy way to change the type of one column in a recarray that
> you know?
>
>  I tried this:
>
>  ra.col1 = ra.col1.astype('i')
>
>  but that didn't seem to work. I assume that means you would have to create
> a new array from the old one with an updated dtype list.
>
>  Thanks,
>
>  Vincent
>
>
>  On 7/8/07 4:51 PM, "Timothy Hochberg" <[EMAIL PROTECTED]> wrote:
>
>
>
>
>  On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>
> Torgil,
>
>  The function seems to work well and is slightly faster than your previous
>  version (about 1/6th faster).
>
>  Yes, I do have columns that start with, what looks like, int's and then
> turnTim,
>
>  out to be floats. Something like below (col6).
>
>  data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
>  ['1','3','1/97','1.12','2.11','0'],
>  ['1','2','3/97',' 1.21','3.12','0'],
>  ['2','1','2/97','1.12','2.11','0'],
>  ['2','2','4/97','1.33','2.26',' 1.23'],
>  ['2','2','5/97','1.73','2.42','1.26']]
>
>  I think what your function assumes is that the 1st element will be the
>  appropriate type. That may not hold if you have missing values or 'mixed
>  types'.
>
>
>
>  Vincent,
>
>  Do you need to auto detect the column types? Things get a lot simpler if
> you have some known schema for each file; then you can simply pass that to
> some reader function. It's also more robust since there's no way in general
> to differentiate a column of integers from a column of floats with no
> decimal part.
>
>  If you do need to auto detect, one approach would be to always read both
> int-like stuff and float-like stuff in as floats. Then after you get the
> array check over the various columns and if any have no fractional parts,
> make a new array where those columns are integers.
>
>   -tim
>
>
> Best,
>
>  Vincent
>
>
>  On 7/8/07 3:31 PM, "Torgil Svensson" < [EMAIL PROTECTED]> wrote:
>
>  > Hi
>  >
>  > I stumble on these types of problems from time to time so I'm
>  > interested in efficient solutions myself.
>  >
>  > Do you have a column which starts with something suitable for int on
>  > the first row (without decimal separator) but has decimals further
>  > down?
>  >
>  > This will be little tricky to support. One solution could be to yield
>  > StopIteration, calculate new type-conversion-functions and start over
>  > iterating over both the old data and the rest of the iterator.
>  >
>  > It'd be great if you could try the load_gen_iter.py I've attached to
>  > my response to Tim.
>  >
>  > Best Regards,
>  >
>  > //Torgil
>  >
>  > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>  >> I am not (yet) very familiar with much of the functionality introduced
> in
>  >> your script Torgil (izip, imap, etc.), but I really appreciate you
> taking
>  >> the time to look at this!
>  >>
>  >> The program stopped with the following error:
>  >>
>  >>   File "load_iter.py", line 48, in 
>  >> convert_row=lambda r: tuple(fn(x) for fn,x in
>  >> izip(conversion_functions,r))
>  >> ValueError: invalid literal for int() with base 10: '2174.875'
>  >>
>  >> A lot of the data I use can have a column with a set of int's (e.g.,
> 0's),
>  >> but then the rest of that same column could be floats. I guess finding
> the
>  >> right conversion function is the tricky part. I was thinking about
> sampling
>  >> each, say, 10th obs to test which function to use. Not sure how that
> would
>  >> work however.
>  >>
>  >> If I ignore the option of an int ( i.e., everything is a float, date, or
>  >> string) then your script is about twice as fast as mine!!
>  >>
>  >> Question: If you do ignore the int's initially, once the rec array is in
>  >> memory, would there be a quick way to check if the floats could pass as
>  >> int's? This may seem like a backwards approach but it might be 'safer'
> if
>  >> you really want to preserve the int's.
>  >>
>  >> Thanks again!
>  >>
>  >> Vincent
>  >>
>  >>
>  >> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
>  >>
>  >>> Given that both your script and the mlab version preloads the whole
>  >>> file before calling numpy constructor I'm curious how that compares in
>  >>> speed to using numpy's fromiter function on your data. Using fromiter
>  >>> should improve on memory usage (~50% ?).
>  >>>
>  >>> The drawback is for string columns where we don't longer know the
>  >>> width of the largest item. I made it fall-back to "object" in this
>  >>> case.
>  >>>
>  >>> Attached is a fromiter version of your script. Possible speedups could
>  >>> be done by trying different approaches to the "convert_row" function,
>  >>> for example using

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Vincent Nijs
Thanks for looking into this Torgil! I agree that this is a much more
complicated setup. I'll check if there is anything I can do on the data end.
Otherwise I'll go with Timothy's suggestion and read in numbers as floats
and convert to int later as needed.

Vincent


On 7/8/07 5:40 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

>> Question: If you do ignore the int's initially, once the rec array is in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer' if
>> you really want to preserve the int's.
> 
> In your case the floats don't pass as ints since you have decimals.
> The attached file takes another approach (sorry for lack of comments).
> If the conversion fail, the current row is stored and the iterator
> exits (without setting a 'finished' parameter to true). The program
> then re-calculates the conversion-functions and checks for changes. If
> the changes are supported (=we have a conversion function for old data
> in the format_changes dictionary) it calls fromiter again with an
> iterator like this:
> 
> def get_data_iterator(row_iter,delim,res):
> for x0,x1,x2,x3,x4,x5 in res['data']:
> x0=float(x0)
> print (x0,x1,x2,x3,x4,x5)
> yield (x0,x1,x2,x3,x4,x5)
> yield 
> (float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float('
> 1.23'))
> for row in row_iter:
> x0,x1,x2,x3,x4,x5=row.split(delim)
> try:
> yield
> (float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))
> except:
> res['row']=row
> return
> res['finished']=True
> 
> res['data'] is the previously converted data. This has the obvious
> disadvantage that if only the last row has fractions in a column,
> it'll cost double memory. Also if many columns change format at
> different places it has to re-convert every time.
> 
> I don't recommend this because of the drawbacks and extra complexity.
> I think it is best to convert your files (or file generation) so that
> float columns are represented with 0.0 instead of 0.
> 
> Best Regards,
> 
> //Torgil
> 
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I am not (yet) very familiar with much of the functionality introduced in
>> your script Torgil (izip, imap, etc.), but I really appreciate you taking
>> the time to look at this!
>> 
>> The program stopped with the following error:
>> 
>>   File "load_iter.py", line 48, in 
>> convert_row=lambda r: tuple(fn(x) for fn,x in
>> izip(conversion_functions,r))
>> ValueError: invalid literal for int() with base 10: '2174.875'
>> 
>> A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
>> but then the rest of that same column could be floats. I guess finding the
>> right conversion function is the tricky part. I was thinking about sampling
>> each, say, 10th obs to test which function to use. Not sure how that would
>> work however.
>> 
>> If I ignore the option of an int (i.e., everything is a float, date, or
>> string) then your script is about twice as fast as mine!!
>> 
>> Question: If you do ignore the int's initially, once the rec array is in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer' if
>> you really want to preserve the int's.
>> 
>> Thanks again!
>> 
>> Vincent
>> 
>> 
>> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
>> 
>>> Given that both your script and the mlab version preloads the whole
>>> file before calling numpy constructor I'm curious how that compares in
>>> speed to using numpy's fromiter function on your data. Using fromiter
>>> should improve on memory usage (~50% ?).
>>> 
>>> The drawback is for string columns where we don't longer know the
>>> width of the largest item. I made it fall-back to "object" in this
>>> case.
>>> 
>>> Attached is a fromiter version of your script. Possible speedups could
>>> be done by trying different approaches to the "convert_row" function,
>>> for example using "zip" or "enumerate" instead of "izip".
>>> 
>>> Best Regards,
>>> 
>>> //Torgil
>>> 
>>> 
>>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 Thanks for the reference John! csv2rec is about 30% faster than my code on
 the same data.
 
 If I read the code in csv2rec correctly it converts the data as it is being
 read using the csv modules. My setup reads in the whole dataset into an
 array of strings and then converts the columns as appropriate.
 
 Best,
 
 Vincent
 
 
 On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
 
> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I wrote the attached (small) program to read in a text/csv file with
>> different data types and convert it into a recarray without having to
>> pre-specify the dtypes or variables names. I am just too la

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Vincent Nijs
Tim,

I do want to auto-detect. Reading numbers in as floats is probably not a
huge penalty. 

Is there an easy way to change the type of one column in a recarray that you
know?

I tried this:

ra.col1 = ra.col1.astype(Œi¹)

but that didn¹t seem to work. I assume that means you would have to create a
new array from the old one with an updated dtype list.

Thanks,

Vincent


On 7/8/07 4:51 PM, "Timothy Hochberg" <[EMAIL PROTECTED]> wrote:

> 
> 
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> Torgil,
>> 
>> The function seems to work well and is slightly faster than your previous
>> version (about 1/6th faster).
>> 
>> Yes, I do have columns that start with, what looks like, int's and then
>> turnTim,
>> out to be floats. Something like below (col6).
>> 
>> data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
>> ['1','3','1/97','1.12','2.11','0'],
>> ['1','2','3/97',' 1.21','3.12','0'],
>> ['2','1','2/97','1.12','2.11','0'],
>> ['2','2','4/97','1.33','2.26',' 1.23'],
>> ['2','2','5/97','1.73','2.42','1.26']]
>> 
>> I think what your function assumes is that the 1st element will be the
>> appropriate type. That may not hold if you have missing values or 'mixed
>> types'.
> 
> 
> Vincent,
> 
> Do you need to auto detect the column types? Things get a lot simpler if you
> have some known schema for each file; then you can simply pass that to some
> reader function. It's also more robust since there's no way in general to
> differentiate a column of integers from a column of floats with no decimal
> part. 
> 
> If you do need to auto detect, one approach would be to always read both
> int-like stuff and float-like stuff in as floats. Then after you get the array
> check over the various columns and if any have no fractional parts, make a new
> array where those columns are integers.
> 
>  -tim
> 
>> Best,
>> 
>> Vincent
>> 
>> 
>> On 7/8/07 3:31 PM, "Torgil Svensson" < [EMAIL PROTECTED]> wrote:
>> 
>>> > Hi
>>> >
>>> > I stumble on these types of problems from time to time so I'm
>>> > interested in efficient solutions myself.
>>> >
>>> > Do you have a column which starts with something suitable for int on
>>> > the first row (without decimal separator) but has decimals further
>>> > down?
>>> >
>>> > This will be little tricky to support. One solution could be to yield
>>> > StopIteration, calculate new type-conversion-functions and start over
>>> > iterating over both the old data and the rest of the iterator.
>>> >
>>> > It'd be great if you could try the load_gen_iter.py I've attached to
>>> > my response to Tim.
>>> >
>>> > Best Regards,
>>> >
>>> > //Torgil
>>> >
>>> > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 >> I am not (yet) very familiar with much of the functionality introduced
in
 >> your script Torgil (izip, imap, etc.), but I really appreciate you
 taking
 >> the time to look at this!
 >> 
 >> The program stopped with the following error:
 >>
 >>   File "load_iter.py", line 48, in 
 >> convert_row=lambda r: tuple(fn(x) for fn,x in
 >> izip(conversion_functions,r))
 >> ValueError: invalid literal for int() with base 10: '2174.875'
 >>
 >> A lot of the data I use can have a column with a set of int¹s (e.g.,
 0¹s),
 >> but then the rest of that same column could be floats. I guess finding
 the 
 >> right conversion function is the tricky part. I was thinking about
 sampling
 >> each, say, 10th obs to test which function to use. Not sure how that
 would
 >> work however.
 >>
 >> If I ignore the option of an int ( i.e., everything is a float, date, or
 >> string) then your script is about twice as fast as mine!!
 >>
 >> Question: If you do ignore the int's initially, once the rec array is in
 >> memory, would there be a quick way to check if the floats could pass as
 >> int's? This may seem like a backwards approach but it might be 'safer'
if
 >> you really want to preserve the int's.
 >>
 >> Thanks again!
 >>
 >> Vincent 
 >>
 >>
 >> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
 >>
> >>> Given that both your script and the mlab version preloads the whole
> >>> file before calling numpy constructor I'm curious how that compares in
> >>> speed to using numpy's fromiter function on your data. Using fromiter
> >>> should improve on memory usage (~50% ?).
> >>>
> >>> The drawback is for string columns where we don't longer know the
> >>> width of the largest item. I made it fall-back to "object" in this
> >>> case.
> >>>
> >>> Attached is a fromiter version of your script. Possible speedups could
> >>> be done by trying different approaches to the "convert_row" function,
> >>> for example using "zip" or "enumerate" instead of "izip".
> >>>
> >>> Best Regards,
> >>>
> >>> //Torgil
> >>>

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Torgil Svensson

Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.


In your case the floats don't pass as ints since you have decimals.
The attached file takes another approach (sorry for lack of comments).
If the conversion fail, the current row is stored and the iterator
exits (without setting a 'finished' parameter to true). The program
then re-calculates the conversion-functions and checks for changes. If
the changes are supported (=we have a conversion function for old data
in the format_changes dictionary) it calls fromiter again with an
iterator like this:

def get_data_iterator(row_iter,delim,res):
   for x0,x1,x2,x3,x4,x5 in res['data']:
   x0=float(x0)
   print (x0,x1,x2,x3,x4,x5)
   yield (x0,x1,x2,x3,x4,x5)
   yield 
(float('2.0'),int('2'),datestr2num('4/97'),float('1.33'),float('2.26'),float('1.23'))
   for row in row_iter:
   x0,x1,x2,x3,x4,x5=row.split(delim)
   try:
   yield
(float(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))
   except:
   res['row']=row
   return
   res['finished']=True

res['data'] is the previously converted data. This has the obvious
disadvantage that if only the last row has fractions in a column,
it'll cost double memory. Also if many columns change format at
different places it has to re-convert every time.

I don't recommend this because of the drawbacks and extra complexity.
I think it is best to convert your files (or file generation) so that
float columns are represented with 0.0 instead of 0.

Best Regards,

//Torgil

On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:

I am not (yet) very familiar with much of the functionality introduced in
your script Torgil (izip, imap, etc.), but I really appreciate you taking
the time to look at this!

The program stopped with the following error:

  File "load_iter.py", line 48, in 
convert_row=lambda r: tuple(fn(x) for fn,x in
izip(conversion_functions,r))
ValueError: invalid literal for int() with base 10: '2174.875'

A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
but then the rest of that same column could be floats. I guess finding the
right conversion function is the tricky part. I was thinking about sampling
each, say, 10th obs to test which function to use. Not sure how that would
work however.

If I ignore the option of an int (i.e., everything is a float, date, or
string) then your script is about twice as fast as mine!!

Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.

Thanks again!

Vincent


On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
>
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
>
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".
>
> Best Regards,
>
> //Torgil
>
>
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> Thanks for the reference John! csv2rec is about 30% faster than my code on
>> the same data.
>>
>> If I read the code in csv2rec correctly it converts the data as it is being
>> read using the csv modules. My setup reads in the whole dataset into an
>> array of strings and then converts the columns as appropriate.
>>
>> Best,
>>
>> Vincent
>>
>>
>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
>>
>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 I wrote the attached (small) program to read in a text/csv file with
 different data types and convert it into a recarray without having to
 pre-specify the dtypes or variables names. I am just too lazy to type-in
 stuff like that :) The supported types are int, float, dates, and strings.

 I works pretty well but it is not (yet) as fast as I would like so I was
 wonder if any of the numpy experts on this list might have some suggestion
 on how to speed it up. I need to read 500MB-1GB files so speed is important
 for me.
>>>
>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>> same.  You may want to compare implementations in case we can
>>> fruitfully cross pollinate them.  In the examples directy, there is an
>>> example script examples/loadrec.py
>>> _

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Timothy Hochberg

On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:


Torgil,

The function seems to work well and is slightly faster than your previous
version (about 1/6th faster).

Yes, I do have columns that start with, what looks like, int's and then
turn
out to be floats. Something like below (col6).

data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
['1','3','1/97','1.12','2.11','0'],
['1','2','3/97','1.21','3.12','0'],
['2','1','2/97','1.12','2.11','0'],
['2','2','4/97','1.33','2.26','1.23'],
['2','2','5/97','1.73','2.42','1.26']]

I think what your function assumes is that the 1st element will be the
appropriate type. That may not hold if you have missing values or 'mixed
types'.




Vincent,

Do you need to auto detect the column types? Things get a lot simpler if you
have some known schema for each file; then you can simply pass that to some
reader function. It's also more robust since there's no way in general to
differentiate a column of integers from a column of floats with no decimal
part.

If you do need to auto detect, one approach would be to always read both
int-like stuff and float-like stuff in as floats. Then after you get the
array check over the various columns and if any have no fractional parts,
make a new array where those columns are integers.

-tim

Best,


Vincent


On 7/8/07 3:31 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

> Hi
>
> I stumble on these types of problems from time to time so I'm
> interested in efficient solutions myself.
>
> Do you have a column which starts with something suitable for int on
> the first row (without decimal separator) but has decimals further
> down?
>
> This will be little tricky to support. One solution could be to yield
> StopIteration, calculate new type-conversion-functions and start over
> iterating over both the old data and the rest of the iterator.
>
> It'd be great if you could try the load_gen_iter.py I've attached to
> my response to Tim.
>
> Best Regards,
>
> //Torgil
>
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I am not (yet) very familiar with much of the functionality introduced
in
>> your script Torgil (izip, imap, etc.), but I really appreciate you
taking
>> the time to look at this!
>>
>> The program stopped with the following error:
>>
>>   File "load_iter.py", line 48, in 
>> convert_row=lambda r: tuple(fn(x) for fn,x in
>> izip(conversion_functions,r))
>> ValueError: invalid literal for int() with base 10: '2174.875'
>>
>> A lot of the data I use can have a column with a set of int¹s (e.g.,
0¹s),
>> but then the rest of that same column could be floats. I guess finding
the
>> right conversion function is the tricky part. I was thinking about
sampling
>> each, say, 10th obs to test which function to use. Not sure how that
would
>> work however.
>>
>> If I ignore the option of an int (i.e., everything is a float, date, or
>> string) then your script is about twice as fast as mine!!
>>
>> Question: If you do ignore the int's initially, once the rec array is
in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer'
if
>> you really want to preserve the int's.
>>
>> Thanks again!
>>
>> Vincent
>>
>>
>> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
>>
>>> Given that both your script and the mlab version preloads the whole
>>> file before calling numpy constructor I'm curious how that compares in
>>> speed to using numpy's fromiter function on your data. Using fromiter
>>> should improve on memory usage (~50% ?).
>>>
>>> The drawback is for string columns where we don't longer know the
>>> width of the largest item. I made it fall-back to "object" in this
>>> case.
>>>
>>> Attached is a fromiter version of your script. Possible speedups could
>>> be done by trying different approaches to the "convert_row" function,
>>> for example using "zip" or "enumerate" instead of "izip".
>>>
>>> Best Regards,
>>>
>>> //Torgil
>>>
>>>
>>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 Thanks for the reference John! csv2rec is about 30% faster than my
code on
 the same data.

 If I read the code in csv2rec correctly it converts the data as it is
being
 read using the csv modules. My setup reads in the whole dataset into
an
 array of strings and then converts the columns as appropriate.

 Best,

 Vincent


 On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:

> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I wrote the attached (small) program to read in a text/csv file
with
>> different data types and convert it into a recarray without having
to
>> pre-specify the dtypes or variables names. I am just too lazy to
type-in
>> stuff like that :) The supported types are int, float, dates, and
>> strings.
>>
>> I works pretty well but it is not (yet) as f

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Vincent Nijs
Torgil,

The function seems to work well and is slightly faster than your previous
version (about 1/6th faster).

Yes, I do have columns that start with, what looks like, int's and then turn
out to be floats. Something like below (col6).

data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
['1','3','1/97','1.12','2.11','0'],
['1','2','3/97','1.21','3.12','0'],
['2','1','2/97','1.12','2.11','0'],
['2','2','4/97','1.33','2.26','1.23'],
['2','2','5/97','1.73','2.42','1.26']]

I think what your function assumes is that the 1st element will be the
appropriate type. That may not hold if you have missing values or 'mixed
types'.

Best,

Vincent


On 7/8/07 3:31 PM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

> Hi
> 
> I stumble on these types of problems from time to time so I'm
> interested in efficient solutions myself.
> 
> Do you have a column which starts with something suitable for int on
> the first row (without decimal separator) but has decimals further
> down?
> 
> This will be little tricky to support. One solution could be to yield
> StopIteration, calculate new type-conversion-functions and start over
> iterating over both the old data and the rest of the iterator.
> 
> It'd be great if you could try the load_gen_iter.py I've attached to
> my response to Tim.
> 
> Best Regards,
> 
> //Torgil
> 
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I am not (yet) very familiar with much of the functionality introduced in
>> your script Torgil (izip, imap, etc.), but I really appreciate you taking
>> the time to look at this!
>> 
>> The program stopped with the following error:
>> 
>>   File "load_iter.py", line 48, in 
>> convert_row=lambda r: tuple(fn(x) for fn,x in
>> izip(conversion_functions,r))
>> ValueError: invalid literal for int() with base 10: '2174.875'
>> 
>> A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
>> but then the rest of that same column could be floats. I guess finding the
>> right conversion function is the tricky part. I was thinking about sampling
>> each, say, 10th obs to test which function to use. Not sure how that would
>> work however.
>> 
>> If I ignore the option of an int (i.e., everything is a float, date, or
>> string) then your script is about twice as fast as mine!!
>> 
>> Question: If you do ignore the int's initially, once the rec array is in
>> memory, would there be a quick way to check if the floats could pass as
>> int's? This may seem like a backwards approach but it might be 'safer' if
>> you really want to preserve the int's.
>> 
>> Thanks again!
>> 
>> Vincent
>> 
>> 
>> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
>> 
>>> Given that both your script and the mlab version preloads the whole
>>> file before calling numpy constructor I'm curious how that compares in
>>> speed to using numpy's fromiter function on your data. Using fromiter
>>> should improve on memory usage (~50% ?).
>>> 
>>> The drawback is for string columns where we don't longer know the
>>> width of the largest item. I made it fall-back to "object" in this
>>> case.
>>> 
>>> Attached is a fromiter version of your script. Possible speedups could
>>> be done by trying different approaches to the "convert_row" function,
>>> for example using "zip" or "enumerate" instead of "izip".
>>> 
>>> Best Regards,
>>> 
>>> //Torgil
>>> 
>>> 
>>> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 Thanks for the reference John! csv2rec is about 30% faster than my code on
 the same data.
 
 If I read the code in csv2rec correctly it converts the data as it is being
 read using the csv modules. My setup reads in the whole dataset into an
 array of strings and then converts the columns as appropriate.
 
 Best,
 
 Vincent
 
 
 On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
 
> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I wrote the attached (small) program to read in a text/csv file with
>> different data types and convert it into a recarray without having to
>> pre-specify the dtypes or variables names. I am just too lazy to type-in
>> stuff like that :) The supported types are int, float, dates, and
>> strings.
>> 
>> I works pretty well but it is not (yet) as fast as I would like so I was
>> wonder if any of the numpy experts on this list might have some
>> suggestion
>> on how to speed it up. I need to read 500MB-1GB files so speed is
>> important
>> for me.
> 
> In matplotlib.mlab svn, there is a function csv2rec that does the
> same.  You may want to compare implementations in case we can
> fruitfully cross pollinate them.  In the examples directy, there is an
> example script examples/loadrec.py
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://p

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Torgil Svensson
Hi

I stumble on these types of problems from time to time so I'm
interested in efficient solutions myself.

Do you have a column which starts with something suitable for int on
the first row (without decimal separator) but has decimals further
down?

This will be little tricky to support. One solution could be to yield
StopIteration, calculate new type-conversion-functions and start over
iterating over both the old data and the rest of the iterator.

It'd be great if you could try the load_gen_iter.py I've attached to
my response to Tim.

Best Regards,

//Torgil

On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
> I am not (yet) very familiar with much of the functionality introduced in
> your script Torgil (izip, imap, etc.), but I really appreciate you taking
> the time to look at this!
>
> The program stopped with the following error:
>
>   File "load_iter.py", line 48, in 
> convert_row=lambda r: tuple(fn(x) for fn,x in
> izip(conversion_functions,r))
> ValueError: invalid literal for int() with base 10: '2174.875'
>
> A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
> but then the rest of that same column could be floats. I guess finding the
> right conversion function is the tricky part. I was thinking about sampling
> each, say, 10th obs to test which function to use. Not sure how that would
> work however.
>
> If I ignore the option of an int (i.e., everything is a float, date, or
> string) then your script is about twice as fast as mine!!
>
> Question: If you do ignore the int's initially, once the rec array is in
> memory, would there be a quick way to check if the floats could pass as
> int's? This may seem like a backwards approach but it might be 'safer' if
> you really want to preserve the int's.
>
> Thanks again!
>
> Vincent
>
>
> On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:
>
> > Given that both your script and the mlab version preloads the whole
> > file before calling numpy constructor I'm curious how that compares in
> > speed to using numpy's fromiter function on your data. Using fromiter
> > should improve on memory usage (~50% ?).
> >
> > The drawback is for string columns where we don't longer know the
> > width of the largest item. I made it fall-back to "object" in this
> > case.
> >
> > Attached is a fromiter version of your script. Possible speedups could
> > be done by trying different approaches to the "convert_row" function,
> > for example using "zip" or "enumerate" instead of "izip".
> >
> > Best Regards,
> >
> > //Torgil
> >
> >
> > On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
> >> Thanks for the reference John! csv2rec is about 30% faster than my code on
> >> the same data.
> >>
> >> If I read the code in csv2rec correctly it converts the data as it is being
> >> read using the csv modules. My setup reads in the whole dataset into an
> >> array of strings and then converts the columns as appropriate.
> >>
> >> Best,
> >>
> >> Vincent
> >>
> >>
> >> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
> >>
> >>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>  I wrote the attached (small) program to read in a text/csv file with
>  different data types and convert it into a recarray without having to
>  pre-specify the dtypes or variables names. I am just too lazy to type-in
>  stuff like that :) The supported types are int, float, dates, and 
>  strings.
> 
>  I works pretty well but it is not (yet) as fast as I would like so I was
>  wonder if any of the numpy experts on this list might have some 
>  suggestion
>  on how to speed it up. I need to read 500MB-1GB files so speed is 
>  important
>  for me.
> >>>
> >>> In matplotlib.mlab svn, there is a function csv2rec that does the
> >>> same.  You may want to compare implementations in case we can
> >>> fruitfully cross pollinate them.  In the examples directy, there is an
> >>> example script examples/loadrec.py
> >>> ___
> >>> Numpy-discussion mailing list
> >>> Numpy-discussion@scipy.org
> >>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>>
> >>
> >>
> >> ___
> >> Numpy-discussion mailing list
> >> Numpy-discussion@scipy.org
> >> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>
> > ___
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
> --
> Vincent R. Nijs
> Assistant Professor of Marketing
> Kellogg School of Management, Northwestern University
> 2001 Sheridan Road, Evanston, IL 60208-2001
> Phone: +1-847-491-4574 Fax: +1-847-491-2498
> E-mail: [EMAIL PROTECTED]
> Skype: vincentnijs
>
>
>
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
_

Re: [Numpy-discussion] Buildbot for numpy

2007-07-08 Thread Barry Wark
Stefan,

No worries. I thought it was something like that. Any thoughts on my
other questions? I'd love to have some ammunition to take to my boss.

Thanks,
Barry

On 7/7/07, stefan <[EMAIL PROTECTED]> wrote:
>
> On Mon, 2 Jul 2007 17:26:15 -0700, "Barry Wark" <[EMAIL PROTECTED]>
> wrote:
> > On a side note, buildbot.scipy.org goes to the DSP lab, Univ. of
> > Stellenbosch's home page, not the buildbot status page.
>
> Sorry about that -- I misconfigured Apache.  Everything should be fine now.
>
> Cheers
> Stéfan
>
>
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Torgil Svensson

On 7/8/07, Timothy Hochberg <[EMAIL PROTECTED]> wrote:



On 7/8/07, Torgil Svensson <[EMAIL PROTECTED]> wrote:
> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
>
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
>
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".

I suspect that you'd do better here if you removed a bunch of layers from
the conversion functions. Right now it looks like:
imap->chain->convert_row->tuple->generator->izip. That's
five levels deep and Python functions are reasonably expensive. I would try
to be a lot less clever and do something like:

def data_iterator(row_iter, delim):
row0 = row_iter.next().split(delim)
converters = find_formats(row0) # left as an exercise
yield tuple(f(x) for f, x in zip(conversion_functions, row0))
for row in row_iter:
yield tuple(f(x) for f, x in zip(conversion_functions, row0))




That sounds sane. I've maybe been attracted to bad habits here and got
away with it since i'm very i/o-bound in these cases. My main
objective has been reducing memory footprint to reduce swapping.



That's just a sketch and I haven't timed it, but it cuts a few levels out of
the call chain, so has a reasonable chance of being faster. If you wanted to
be really clever, you could use some exec magic after you figure out the
conversion functions to compile a special function that generates the tuples
directly without any use of tuple or zip. I don't have time to work through
the details right now, but the code you would compile would end up looking
this:

for (x0, x1, x2) in row_iter:
   yield (int(x0), float(x1), float(x2))

Here we've assumed that find_formats determined that there are three fields,
an int and two floats. Once you have this info you can build an appropriate
function and exec it. This would cut another couple levels out of the call
chain. Again, I haven't timed it, or tried it, but it looks like it would be
fun to try.

-tim




Thank you for the lesson!  Great tip. This opens up for a variety of
new coding options. I've made an attempt on the fun part. Attached are
a version that generates the following generator code for Vincent's
__main__=='__name__' - code:

def get_data_iterator(row_iter,delim):
   yield 
(int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2'))
   for row in row_iter:
   x0,x1,x2,x3,x4,x5=row.split(delim)
   yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))

Best Regards,

//Torgil
import numpy as N
import itertools,csv
from matplotlib.dates import datestr2num
from itertools import imap,izip,chain

string_conversions=[
# conversion function,  numpy dtype
( int,  N.dtype(int)),
( float,N.dtype(float)  ),
( datestr2num,  N.dtype(float)  ),
]

def string_to_dt_cvt(s):
"""
Converting data to the appropriate type
"""
for fn,dt in string_conversions:
try:
v=fn(s)
return dt,fn
except:
pass
return str,N.dtype(object)


def load(fname,delim = ',',has_varnm = True, prn_report = True):
"""
Loading data from a file using fromiter. Returns a recarray.
"""

row_iter=open(fname,'rb')
row0=map(str.strip,row_iter.next().split(delim))
if not has_varnm:
varnm = ['col%s' % str(i+1) for i in xrange(len(row0))]
dt_row=row0
else:
varnm = [i.strip() for i in row0]
dt_row=map(str.strip,row_iter.next().split(delim))

str_cvt=[string_to_dt_cvt(item) for item in dt_row]
descr=[(name,dt) for name,(dt,cvt_fn) in zip(varnm,str_cvt)]
var_nm=["x%d" % i for i,(dt,cvt_fn) in enumerate(str_cvt)]
fn_nm=[fn.__name__ for dt,fn in str_cvt]

ident=" "*4
generator_code="\n".join([
"def get_data_iterator(row_iter,delim):",
ident+"yield (" + ",".join(["%s('%s')" % (f,r) for f,r in zip(fn_nm,dt_row)])+")",
ident+"for row in row_iter:",
ident*2+",".join(var_nm)+"=row.split(delim)",
ident*2+"yield (" + ",".join(["%s(%s)" % (f,v) for f,v in zip(fn_nm,var_nm)])+")",
])

exec(compile(generator_code,'','exec'))

data=N.fromiter(get_data_iterator(row_iter,delim),dtype=descr).view(N.recarray)

# load report
if prn_report:
print "##\n"
print "Loaded file: %s\n" % fname
print "Nr obs: %s\n" % data.shape[0]
print "Variables and datatypes:\n"
for i in data.dtype.descr:
print 

Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Vincent Nijs
I am not (yet) very familiar with much of the functionality introduced in
your script Torgil (izip, imap, etc.), but I really appreciate you taking
the time to look at this!

The program stopped with the following error:

  File "load_iter.py", line 48, in 
convert_row=lambda r: tuple(fn(x) for fn,x in
izip(conversion_functions,r))
ValueError: invalid literal for int() with base 10: '2174.875'

A lot of the data I use can have a column with a set of int¹s (e.g., 0¹s),
but then the rest of that same column could be floats. I guess finding the
right conversion function is the tricky part. I was thinking about sampling
each, say, 10th obs to test which function to use. Not sure how that would
work however.

If I ignore the option of an int (i.e., everything is a float, date, or
string) then your script is about twice as fast as mine!!

Question: If you do ignore the int's initially, once the rec array is in
memory, would there be a quick way to check if the floats could pass as
int's? This may seem like a backwards approach but it might be 'safer' if
you really want to preserve the int's.

Thanks again!

Vincent


On 7/8/07 5:52 AM, "Torgil Svensson" <[EMAIL PROTECTED]> wrote:

> Given that both your script and the mlab version preloads the whole
> file before calling numpy constructor I'm curious how that compares in
> speed to using numpy's fromiter function on your data. Using fromiter
> should improve on memory usage (~50% ?).
> 
> The drawback is for string columns where we don't longer know the
> width of the largest item. I made it fall-back to "object" in this
> case.
> 
> Attached is a fromiter version of your script. Possible speedups could
> be done by trying different approaches to the "convert_row" function,
> for example using "zip" or "enumerate" instead of "izip".
> 
> Best Regards,
> 
> //Torgil
> 
> 
> On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> Thanks for the reference John! csv2rec is about 30% faster than my code on
>> the same data.
>> 
>> If I read the code in csv2rec correctly it converts the data as it is being
>> read using the csv modules. My setup reads in the whole dataset into an
>> array of strings and then converts the columns as appropriate.
>> 
>> Best,
>> 
>> Vincent
>> 
>> 
>> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
>> 
>>> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
 I wrote the attached (small) program to read in a text/csv file with
 different data types and convert it into a recarray without having to
 pre-specify the dtypes or variables names. I am just too lazy to type-in
 stuff like that :) The supported types are int, float, dates, and strings.
 
 I works pretty well but it is not (yet) as fast as I would like so I was
 wonder if any of the numpy experts on this list might have some suggestion
 on how to speed it up. I need to read 500MB-1GB files so speed is important
 for me.
>>> 
>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>>> same.  You may want to compare implementations in case we can
>>> fruitfully cross pollinate them.  In the examples directy, there is an
>>> example script examples/loadrec.py
>>> ___
>>> Numpy-discussion mailing list
>>> Numpy-discussion@scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>> 
>> 
>> 
>> ___
>> Numpy-discussion mailing list
>> Numpy-discussion@scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>> 
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

-- 
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: [EMAIL PROTECTED]
Skype: vincentnijs



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Timothy Hochberg

On 7/8/07, Torgil Svensson <[EMAIL PROTECTED]> wrote:


Given that both your script and the mlab version preloads the whole
file before calling numpy constructor I'm curious how that compares in
speed to using numpy's fromiter function on your data. Using fromiter
should improve on memory usage (~50% ?).

The drawback is for string columns where we don't longer know the
width of the largest item. I made it fall-back to "object" in this
case.

Attached is a fromiter version of your script. Possible speedups could
be done by trying different approaches to the "convert_row" function,
for example using "zip" or "enumerate" instead of "izip".



I suspect that you'd do better here if you removed a bunch of layers from
the conversion functions. Right now it looks like:
imap->chain->convert_row->tuple->generator->izip. That's five levels deep
and Python functions are reasonably expensive. I would try to be a lot less
clever and do something like:

   def data_iterator(row_iter, delim):
   row0 = row_iter.next().split(delim)
   converters = find_formats(row0) # left as an exercise
   yield tuple(f(x) for f, x in zip(conversion_functions, row0))
   for row in row_iter:
   yield tuple(f(x) for f, x in zip(conversion_functions, row0))

That's just a sketch and I haven't timed it, but it cuts a few levels out of
the call chain, so has a reasonable chance of being faster. If you wanted to
be really clever, you could use some exec magic after you figure out the
conversion functions to compile a special function that generates the tuples
directly without any use of tuple or zip. I don't have time to work through
the details right now, but the code you would compile would end up looking
this:

for (x0, x1, x2) in row_iter:
  yield (int(x0), float(x1), float(x2))

Here we've assumed that find_formats determined that there are three fields,
an int and two floats. Once you have this info you can build an appropriate
function and exec it. This would cut another couple levels out of the call
chain. Again, I haven't timed it, or tried it, but it looks like it would be
fun to try.

-tim






On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
> Thanks for the reference John! csv2rec is about 30% faster than my code
on
> the same data.
>
> If I read the code in csv2rec correctly it converts the data as it is
being
> read using the csv modules. My setup reads in the whole dataset into an
> array of strings and then converts the columns as appropriate.
>
> Best,
>
> Vincent
>
>
> On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:
>
> > On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
> >> I wrote the attached (small) program to read in a text/csv file with
> >> different data types and convert it into a recarray without having to
> >> pre-specify the dtypes or variables names. I am just too lazy to
type-in
> >> stuff like that :) The supported types are int, float, dates, and
strings.
> >>
> >> I works pretty well but it is not (yet) as fast as I would like so I
was
> >> wonder if any of the numpy experts on this list might have some
suggestion
> >> on how to speed it up. I need to read 500MB-1GB files so speed is
important
> >> for me.
> >
> > In matplotlib.mlab svn, there is a function csv2rec that does the
> > same.  You may want to compare implementations in case we can
> > fruitfully cross pollinate them.  In the examples directy, there is an
> > example script examples/loadrec.py
> > ___
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
>
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion






--
.  __
.   |-\
.
.  [EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

2007-07-08 Thread Torgil Svensson

Given that both your script and the mlab version preloads the whole
file before calling numpy constructor I'm curious how that compares in
speed to using numpy's fromiter function on your data. Using fromiter
should improve on memory usage (~50% ?).

The drawback is for string columns where we don't longer know the
width of the largest item. I made it fall-back to "object" in this
case.

Attached is a fromiter version of your script. Possible speedups could
be done by trying different approaches to the "convert_row" function,
for example using "zip" or "enumerate" instead of "izip".

Best Regards,

//Torgil


On 7/8/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:

Thanks for the reference John! csv2rec is about 30% faster than my code on
the same data.

If I read the code in csv2rec correctly it converts the data as it is being
read using the csv modules. My setup reads in the whole dataset into an
array of strings and then converts the columns as appropriate.

Best,

Vincent


On 7/6/07 8:53 PM, "John Hunter" <[EMAIL PROTECTED]> wrote:

> On 7/6/07, Vincent Nijs <[EMAIL PROTECTED]> wrote:
>> I wrote the attached (small) program to read in a text/csv file with
>> different data types and convert it into a recarray without having to
>> pre-specify the dtypes or variables names. I am just too lazy to type-in
>> stuff like that :) The supported types are int, float, dates, and strings.
>>
>> I works pretty well but it is not (yet) as fast as I would like so I was
>> wonder if any of the numpy experts on this list might have some suggestion
>> on how to speed it up. I need to read 500MB-1GB files so speed is important
>> for me.
>
> In matplotlib.mlab svn, there is a function csv2rec that does the
> same.  You may want to compare implementations in case we can
> fruitfully cross pollinate them.  In the examples directy, there is an
> example script examples/loadrec.py
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

import numpy as N
import pylab,itertools,csv
from itertools import imap,izip,chain



string_conversions=[
# conversion function,  numpy dtype
( int,  N.dtype(int)),
( float,N.dtype(float)  ),
( pylab.datestr2num,N.dtype(float)  ),
]

def string_to_dt_cvt(s):
"""
Converting data to the appropriate type
"""
for fn,dt in string_conversions:
try:
v=fn(s)
return dt,fn
except:
pass
return str,N.dtype(object)


def load(fname,delim = ',',has_varnm = True, prn_report = True):
"""
Loading data from a file using the csv module. Returns a recarray.
"""
global data_iterator,cvt,descr

f=open(fname,'rb')
row_iterator=itertools.imap(lambda x: x.split(delim),f)

first_row=row_iterator.next()
cols=len(first_row)
if not has_varnm:
varnm = ['col%s' % str(i+1) for i in xrange(cols)]
dt_row=first_row
else:
varnm = [i.strip() for i in first_row]
dt_row=row_iterator.next()

descr=[]
conversion_functions=[]
for name,item in zip(varnm,dt_row):
dtype,cvt_fn=string_to_dt_cvt(item)
descr.append((name,dtype))
conversion_functions.append(cvt_fn)
convert_row=lambda r: tuple(fn(x) for fn,x in izip(conversion_functions,r))

data_iterator=imap(convert_row,chain([dt_row],row_iterator))
data=N.fromiter(data_iterator,dtype=descr).view(N.recarray)

# load report
if prn_report:
print "##\n"
print "Loaded file: %s\n" % fname
print "Nr obs: %s\n" % data.shape[0]
print "Variables and datatypes:\n"
for i in data.dtype.descr:
print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1], str(data[i[0]][0:3]))
print "\n##\n"

return data

def show_dates(dates):
	return N.array([i.strftime('%d %b %y') for i in pylab.num2date(dates)])

if __name__ == '__main__':

	# creating data
	data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
			['1','3','1/97','1.12','2.11','1.2'],
			['1','2','3/97','1.21','3.12','1.43'],
			['2','1','2/97','1.12','2.11','1.28'],
			['2','2','4/97','1.33','2.26','1.23'],
			['2','2','5/97','1.73','2.42','1.26']]

	# saving data to csv file
	f = open('testdata.csv','wb')
	output = csv.writer(f)
	for i in data:
		output.writerow(i)
	f.close()

	# opening data file with variable names
	ra = load('testdata.csv')	
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion