Re: [Numpy-discussion] How to convert a list into a structured array?

Wei Su Tue, 05 May 2009 22:58:14 -0700

 
Hi, Stephen:
 
This is fantastic. I shall read your codes carefully next week. (I am taking 
the rest of the week off for vacation.) Hopefully I am not so dumb that I need 
to ask again.
 
Regards,
 
Wei Su

--- On Wed, 5/6/09, Stephen Simmons <[email protected]> wrote:

From: Stephen Simmons <[email protected]>
Subject: Re: [Numpy-discussion] How to convert a list into a structured array?
To: "Discussion of Numerical Python" <[email protected]>
Date: Wednesday, May 6, 2009, 5:46 AM

Wei Su wrote: 

Hi, Francesc:

Thanks a lot for offering me help. My code is really simple as of now.

**********************************************************************************
from pyodbc import * 
from rpy import * 

cnxn = connect('DRIVER={SQL 
Server};SERVER=srdata01\\sql2k5;DATABASE=Qai;UID=;PWD=') 
cursor = cnxn.cursor() 

cursor.execute("select IsrCode, MstrName from qai..qaiLinkBase") 
data = cursor.fetchall() 

cursor.close() 
***************************************************

The result, data, I got from the above code tends to be a giant list, which is 
very hard to handle. My goal is to to turn it into a record array so that i can 
access the field directly by name or by index. My data is typically numerical, 
character and datetime variables. no other complications.

>From the above code, you can also see that I used R for some time. But I have 
>to switch to something else because I sometimes cannot even download all my 
>data via R due to its memory limit under windows. I thought NumPy might be the 
>solution. But I am not sure. Anybody can let me know whether Python has a 
>memory limit? or can I use virtual memory by calling some Python module?

Thanks in advance.

Wei  Su

Hi Wei Su,

Below is an example from the code I use to read text files into recarrays. The 
same approach can be used for your SQL data by redefining the inner 
iterator(path) function to execute your SQL query.

If your data is really big, you could also use the PyTables package (written by 
Francesc actually) to store SQL extracts as numpy-compatible HDF tables. The 
HDF format can compress the data transparently, so the resulting data files are 
1/10 the size of an equivalent text dump. You can then read any or all rows 
into memory for subsequent process using table.read[row_from, row_to], thereby 
avoiding running out of memory if your dataset is really big. PyTables/HDF is 
also really fast for reading. As an example, my three year old laptop with slow 
hard drive achieves up to 250,000 row per second speeds on GROUP BY-style 
subtotals. This uses PyTables for storing the data and numpy's bincount() 
function for doing the aggregation.

Stephen

def text_file_to_sorted_numpy_array(path, dtype, row_fn, max_rows=None, 
header=None, 
                                           order_by=None, min_row_length=None):
    """
    Read a database extract into a numpy recarray, which is possibly sorted 
then returned.
        path            Path to the text file.
        dtype           String giving column names and numpy data types
                        e.g. 'COL1,S8 COL2,i4'
        row_fn          Optional function splitting a row into a list that is
                        compatible with the numpy array's dtype. The function
                        can indicate the row should be skipped by returning
                        None. If not given, the row has leading and trailing
                        whitespace removed and then is split on '|'.
        order_by        Optional list of column names used to sort the array.
        header          Optional prefix for a header line. If given, there
                        must be a line with this prefix within the first 20 
lines.
                        Any leading whitespace is removed before checking.
        max_rows        Optional maximum number of rows that a file will 
contain.
        min_row_length  Optional length of row in text file, used to estimate
                        upper bound on size of final array. One or both of
                        max_rows and min_row_length must be given.
    """
    # Create a numpy array large enough to hold the entire file in memory
    if min_row_length:
        file_size = os.stat(path).st_size
        num_rows_upper_bound = file_size/min_row_length
    else:
        num_rows_upper_bound = max_rows
    if num_rows_upper_bound is None:
        raise ValueError('No information given about size of the final array')
    if max_rows and num_rows_upper_bound>max_rows:
        raise ValueError("'%s' is %d bytes long, too large to fit in memory" % 
(os.path.basename(path), file_size))

    # Define an iterator that reads the data file    
    def iterator(path):
        # Read the file
        with file(path, 'rU') as fh:
            ftype, prefix = 
os.path.splitext(os.path.basename(path))[0].split('-', 2)
            pb = ProgressBar(prefix=prefix)
            # Read the data lines
            ctr = idx = 0            
            for s in fh:
                s = s.strip()
                if s in ('\x1A', '-', '') or s.startswith('-------'):
                    # Empty lines after end of real data
                    continue
                res = row_fn(s)
                if res:
                    yield res
                ctr+=1
                if ctr%1000==0:
                    total_rows = float(file_size*ctr)/float(fh.tell())
                    pb(ctr, total=total_rows)
            pb(ctr, last=True)

    # Create an empty array to hold all data, then fill in blocks of 5000 rows
    # Doing this by blocks is 4x faster than adding one row at a time..
    dtype = list( tuple(x.split(',')) for x in dtype.split() )
    arr = numpy.zeros(num_rows_upper_bound, dtype)
    def block_iterator(iterator, blk_size):
        "Group iterator into lists with blk_size elements"
        res = []
        for i in iterator:
            res.append(i)
            if len(res)==blk_size:
                yield res
                res = []
        if res:
            yield res
    # Now fill the array            
    i = 0
    try:
        for blk in block_iterator(iterator(path), 5000):
            b = len(blk)
            tmp = numpy.rec.fromrecords(blk, dtype=dtype, shape=b)
            arr[i:i+b] = tmp
            i+=b
    except KeyboardInterrupt:
        pass
    arr = arr[:i]       # Remove unused rows at the end of the array

    # Sort array if required
    if order_by:
        print "  Sorting %d-row array on %r" % (len(arr), order_by)
        arr.sort(order=order_by)

    # Return the final array
    return arr

-----Inline Attachment Follows-----

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] How to convert a list into a structured array?

Reply via email to