Hi,
Does anyone have any suggestions for summarising data in numpy?
The quick description is that I want to do something like the SQL statement:
SELECT sum(field1), sum(field2) FROM table GROUP BY field3;
The more accurate description is that my data is stored in PyTables HDF
format, with 24 monthly files, each with 4m records describing how
customers performed that month. Each record looks something like this:
('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0,
2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75,
87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95)
The first 5 fields are status fields (month_string, account_number,
product_code, account_status, months_since_customer_joined). The
remaining 48 fields represent different aspects of the customer's
performance during that month. I read 100,000 of these records at a time
and turn them into a numpy recarray with:
dat = hdf_table.read(start=pos, stop=pos+block_size)
dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr)
I'd like to reduce these 96m records x 53 fields down to monthly
averages for each tuple (month_string, months_since_customer_joined)
which in the case above is ('200604', 12L). This will let me compare the
performance of newly acquired customers at the same point in their
lifecycle as customers acquired 1 or 2 years ago.
The end result should be a dataset something like
res[month_index, months_since_customer_joined]
= array([ num_records, sum_field_5, sum_field_6, sum_field_7, ...
sum_field_52 ])
with a shape of (24, 24, 49).
I've played around with lexsort(), take(), sum(), etc, but get very
confused and end up feeling that I'm making things more complicated than
they need to be. So any advice from numpy veterans on how best to
proceed would be very welcome!
Cheers
Stephen
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Numpy-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/numpy-discussion