Chris Rebert wrote: > On Tue, Nov 3, 2009 at 7:43 AM, Siva Subramanian > <elpost...@rediffmail.com> wrote: >> Hello all, >> >> I am new on this list and computer programming >> >> I have two distinct statistical files (both csv) >> >> 1. Report_2_5 – this is a report dump containing over a 10 million >> records and is different every day >> >> 2. Customer_id dump – this is a daily dump of customers who have >> made payments. This is generally a million record >> >> I need to extract past history depending on customers who make regular >> payments >> >> For example, >> >> Report_2_5 >> >> Customer ID, Plan_NO, stat1, vol2, amount3 >> 2134, Ins1, 10000, 20000, 10 >> 2112, Ins3, 30000, 20000, 10 >> 2121, Ins3, 30000, 20000, 10 >> 2145, Ins2, 15000, 10000, 5 >> 2245, Ins2, 15000, 10000, 5 >> 0987, Ins1, 10000, 20000, 10 >> >> 4546, Ins1, 10020, 21000, 10 >> >> 6757, Ins1, 10200, 22000, 10 >> … >> >> customer_id dump >> >> >> 0987 >> >> 4546 >> >> 6757 >> >> 2134 >> >> I need to process the Report_2_5 and extract the following output >> >> Stat1: 40220 >> Vol2 : 83000 >> Amount3 : 40 >> >> I am new to programming and I have been extracting this data using MS – >> Access and I badly need a better solution. > > Have you considered using a proper SQL database? (See > http://en.wikipedia.org/wiki/SQL ; MySQL is one example: > http://en.wikipedia.org/wiki/MySQL) > Mucking around with CSV files like this is basically doing the work of > some simple SQL queries, only in an ad-hoc, inefficient manner. (MS > Access is essentially a non-industrial-strength SQL for > non-programmers.)
Industrial strength or not, Access should be capable of solving the OP's problem. So it would be interesting what's so bad about it in this case. Anyway, here's a database-free python solution: import csv REPORT = "report.csv" CUSTOMERS = "customers.csv" with open(CUSTOMERS) as instream: next(instream) # skip header # put customer ids into a set for fast lookup customer_ids = set(line.strip() for line in instream) with open(REPORT) as instream: rows = csv.reader(instream) # find columns headers = [column.strip() for column in rows.next()] customer_column = headers.index("Customer ID") sum_over_columns = [headers.index(s) for s in "stat1 vol2 amount3".split()] # initialize totals sigma = [0] * len(headers) # calculate totals for row in rows: if row[customer_column] in customer_ids: for index in sum_over_columns: sigma[index] += int(row[index]) # print totals for index in sum_over_columns: print "%-10s %10d" % (headers[index] + ":", sigma[index]) The limiting factor for this approach is the customer_ids set which at some point may no longer fit into memory. Peter -- http://mail.python.org/mailman/listinfo/python-list