Edgar Almonte wrote: > fist time i saw the statement "with" , is part of the module csv ? , > that make a loop through the file ? is not the sortkey function > waiting for a paramenter ( the row ) ? i don't see how is get pass , > the "key" is a parameter of sorted function ? , reader is a function > of csv module ? if "with..." is a loop that go through the file is > the second with inside of that loop ? ( i dont see how the write of > the output file catch the line readed
with open(filename) as fileobj: do_something_with(fileobj) is a shortcut for fileobj = open(filename) do_something_with(fileobj) fileobj.close() But it is not only shorter; it also guarantees that the file will be closed even if an error occurs while it is being processed. In my code the file object is wrapped into a csv.reader. for row in csv.reader(fileobj, delimiter="|"): # do something with row walks through the file one row at a time where the row is a list of the columns. To be able to sort these rows you have to read them all into memory. You typically do that with rows_iter = csv.reader(fileobj, delimiter="|") rows = list(rows_iter) and can then sort the rows with rows.sort() Because this two-step process is so common there is a builtin sorted() that converts an iterable (the rows here) into a sorted list. Now consider the following infile: $ cat infile.txt aXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11| bXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11| XXXXXXXXX XXXX| 0000000000000.00| 0000000088115.39| XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00| XXXXXXXXX XXXX| 0000000000000.00| 0000000090443.29| cXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00| XXXXXXXXX XXXX| 0000000088115.39| 0000000000000.00| XXXXXXXXX XXXX| 0000000000000.00| 0000000088335.39| XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00| XXXXXXXXX XXXX| 0000000088335.39| 0000000000000.00| XXXXXXXXX XXXX| 0000000090443.29| 0000000000000.00| dXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00| If we read it and sort it we get the following: >>> import csv >>> with open("infile.txt") as fileobj: ... rows = sorted(csv.reader(fileobj, delimiter="|")) ... >>> from pprint import pprint >>> pprint(rows) [['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''], ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', '']] Can you infer the sort order of the list of lists above? The rows are sorted by the first item in the list, then rows whose first item compares equal are sorted by the second and so on. This sort order is not something built into the sort() method, but rather the objects that are compared. sort() uses the usual operators like < and == internally. Now what would you do if you wanted to sort your data by the third column, say? Here the key parameter comes into play. You can use it to provide a function that takes an item in the list to be sorted and returns something that is used instead of the items to compare them to each other: >> def extract_third_column(row): ... return row[2] ... >>> rows.sort(key=extract_third_column) >>> pprint(rows) [['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', '']] The key function you actually need is a bit more sophisticated. You want rows with equal nonzero values to end close together, no matter whether the interesting value is in the second or third column. Let's try: >>> def extract_nonzero_column(row): ... if row[1] == ' 0000000000000.00': ... return row[2] ... else: ... return row[1] ... Here's a preview of the keys: >>> for row in rows: ... print extract_nonzero_column(row) ... 0000000088115.39 0000000088335.39 0000000090443.29 0000000090453.29 0000000090453.29 0000000011111.11 0000000011111.11 0000000011111.11 0000000011111.11 0000000088115.39 0000000088335.39 0000000090443.29 Looks good, let's apply: >>> pprint(sorted(rows, key=extract_nonzero_column)) [['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']] Almost there, but we want the rows with nonzero values in the third column before those with nonzero values in the second. Emile's and my solution was to add a flag to the sort key, True if the nonzero value is in the first, False else because False < True in python: >>> sorted([True, False, False, True, False]) [False, False, False, True, True] Just as with the rows the flag is the second item in the result tuple, and is only used to sort the items if their first value compares equal. >>> def sortkey(row): ... if row[1] == ' 0000000000000.00': ... return row[2], False ... else: ... return row[1], True ... >>> rows.sort(key=sortkey) >>> pprint(rows) [['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''], ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']] Now you just have to write the result back to a file. Two problems remain: unpaired values are not detected and duplicate values destroy the right-left-right-left alteration, too. Bonus: Python's list.sort() method is "stable", it doesn't affect the relative order of items in the list that compare equal. This allows to replace one sort with a complex key with two sorts with simpler keys: >>> rows.sort(key=lambda row: row[1]) >>> rows.sort(key=lambda row: max(row[1:])) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor