Re: [Numpy-discussion] How to convert a list into a structured array?

Stephen Simmons Tue, 05 May 2009 22:47:21 -0700

Wei Su wrote:

Hi, Francesc:Thanks a lot for offering me help. My code is really simple as of now.**********************************************************************************
from pyodbc import *

from rpy import *
cnxn = connect(/'DRIVER={SQLServer};SERVER=srdata01\\sql2k5;DATABASE=_Qai_;UID=;PWD='/)
cursor = cnxn.cursor()

cursor.execute(/"select IsrCode, MstrName from _qai_..qaiLinkBase"/)

data = cursor.fetchall()

cursor.close()

***************************************************
The result, data, I got from the above code tends to be a giant list,which is very hard to handle. My goal is to to turn it into a recordarray so that i can access the field directly by name or by index. Mydata is typically numerical, character and datetime variables. noother complications.From the above code, you can also see that I used R for some time. ButI have to switch to something else because I sometimes cannot evendownload all my data via R due to its memory limit under windows. Ithought NumPy might be the solution. But I am not sure. Anybody canlet me know whether Python has a memory limit? or can I use virtualmemory by calling some Python module?Thanks in advance.Wei Su


Hi Wei Su,

Below is an example from the code I use to read text files intorecarrays. The same approach can be used for your SQL data by redefiningthe inner iterator(path) function to execute your SQL query.

If your data is really big, you could also use the PyTables package(written by Francesc actually) to store SQL extracts as numpy-compatibleHDF tables. The HDF format can compress the data transparently, so theresulting data files are 1/10 the size of an equivalent text dump. Youcan then read any or all rows into memory for subsequent process usingtable.read[row_from, row_to], thereby avoiding running out of memory ifyour dataset is really big. PyTables/HDF is also really fast forreading. As an example, my three year old laptop with slow hard driveachieves up to 250,000 row per second speeds on GROUP BY-stylesubtotals. This uses PyTables for storing the data and numpy'sbincount() function for doing the aggregation.


Stephen

def text_file_to_sorted_numpy_array(path, dtype, row_fn, max_rows=None,header=None,order_by=None,min_row_length=None):

"""

Read a database extract into a numpy recarray, which is possiblysorted then returned.

       path            Path to the text file.
       dtype           String giving column names and numpy data types
                       e.g. 'COL1,S8 COL2,i4'

row_fn Optional function splitting a row into a listthat iscompatible with the numpy array's dtype. Thefunction

                       can indicate the row should be skipped by returning
                       None. If not given, the row has leading and trailing
                       whitespace removed and then is split on '|'.

order_by Optional list of column names used to sort thearray.

       header          Optional prefix for a header line. If given, there

must be a line with this prefix within the first20 lines.

                       Any leading whitespace is removed before checking.

max_rows Optional maximum number of rows that a file willcontain.min_row_length Optional length of row in text file, used toestimate

                       upper bound on size of final array. One or both of
                       max_rows and min_row_length must be given.
   """
   # Create a numpy array large enough to hold the entire file in memory
   if min_row_length:
       file_size = os.stat(path).st_size
       num_rows_upper_bound = file_size/min_row_length
   else:
       num_rows_upper_bound = max_rows
   if num_rows_upper_bound is None:

raise ValueError('No information given about size of the finalarray')

   if max_rows and num_rows_upper_bound>max_rows:

raise ValueError("'%s' is %d bytes long, too large to fit inmemory" % (os.path.basename(path), file_size))

# Define an iterator that reads the data filedef iterator(path):

       # Read the file
       with file(path, 'rU') as fh:

ftype, prefix =os.path.splitext(os.path.basename(path))[0].split('-', 2)

           pb = ProgressBar(prefix=prefix)
           # Read the data lines

ctr = idx = 0for s in fh:

               s = s.strip()
               if s in ('\x1A', '-', '') or s.startswith('-------'):
                   # Empty lines after end of real data
                   continue
               res = row_fn(s)
               if res:
                   yield res
               ctr+=1
               if ctr%1000==0:
                   total_rows = float(file_size*ctr)/float(fh.tell())
                   pb(ctr, total=total_rows)
           pb(ctr, last=True)

# Create an empty array to hold all data, then fill in blocks of5000 rows

   # Doing this by blocks is 4x faster than adding one row at a time.
   dtype = list( tuple(x.split(',')) for x in dtype.split() )
   arr = numpy.zeros(num_rows_upper_bound, dtype)
   def block_iterator(iterator, blk_size):
       "Group iterator into lists with blk_size elements"
       res = []
       for i in iterator:
           res.append(i)
           if len(res)==blk_size:
               yield res
               res = []
       if res:
           yield res

# Now fill the arrayi = 0

   try:
       for blk in block_iterator(iterator(path), 5000):
           b = len(blk)
           tmp = numpy.rec.fromrecords(blk, dtype=dtype, shape=b)
           arr[i:i+b] = tmp
           i+=b
   except KeyboardInterrupt:
       pass
   arr = arr[:i]       # Remove unused rows at the end of the array

   # Sort array if required
   if order_by:
       print "  Sorting %d-row array on %r" % (len(arr), order_by)
       arr.sort(order=order_by)

   # Return the final array
   return arr

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] How to convert a list into a structured array?

Reply via email to