On 2015-05-14 09:57, 20/20 Lab wrote: > On 05/13/2015 06:23 PM, Steven D'Aprano wrote: >>> I have a LARGE csv file that I need to process. 110+ columns, >>> 72k rows. I managed to write enough to reduce it to a few >>> hundred rows, and the five columns I'm interested in. > I actually stumbled across the csv module after coding enough to > make a list of lists. So that is more the reason I approached the > list; Nothing like spending hours (or days) coding something that > already exists and just dont know about. >>> Now is were I have my problem: >>> >>> myList = [ [123, "XXX", "Item", "Qty", "Noise"], >>> [72976, "YYY", "Item", "Qty", "Noise"], >>> [123, "XXX" "ItemTypo", "Qty", "Noise"] ] >>> >>> Basically, I need to check for rows with duplicate accounts >>> row[0] and staff (row[1]), and if so, remove that row, and add >>> it's Qty to the original row. I really dont have a clue how to >>> go about this. >> >> processed = {} # hold the processed data in a dict >> >> for row in myList: >> account, staff = row[0:2] >> key = (account, staff) # Put them in a tuple. >> if key in processed: >> # We've already seen this combination. >> processed[key][3] += row[3] # Add the quantities. >> else: >> # Never seen this combination before. >> processed[key] = row >> >> newlist = list(processed.values()) >> > It does, immensely. I'll make this work. Thank you again for the > link from yesterday and apologies for hitting the wrong reply > button. I'll have to study more on the usage and implementations > of dictionaries and tuples.
In processing the initial CSV file, I suspect that using a csv.DictReader would make the code a bit cleaner. Additionally, as you're processing through the initial file, unless you need the intermediate data, you should be able to do it in one pass. Something like HEADER_ACCOUNT = "account" HEADER_STAFF = "staff" HEADER_QTY = "Qty" processed = {} with open("data.csv") as f: reader = csv.DictReader(f) for row in reader: if should_process_row(row): account = row[HEADER_ACCOUNT] staff = row[HEADER_STAFF] qty = row[HEADER_QTY] try: row[HEADER_QTY] = qty = int(qty) except Exception: # not a numeric quantity? continue # from Steven's code key = (account, staff) if key in processed: processed[key][HEADER_QTY] += qty else: processed[key][HEADER_QTY] = row so_something_with(processed.values()) I find that using names is a lot clearer than using arbitrary indexing. Barring that, using indexes-as-constants still would add further clarity. -tkc . -- https://mail.python.org/mailman/listinfo/python-list