Re: [Tutor] List processing question - consolidating duplicate entries
Richard Querin wrote: > import itertools, operator > for k, g in itertools.groupby(sorted(data), key=operator.itemgetter(0, > 1, 2, 3)): > print k, sum(item[4] for item in g) > > > > I'm trying to understand what's going on in the for statement but I'm > having troubles. The interpreter is telling me that itemgetter expects 1 > argument and is getting 4. You must be using an older version of Python, the ability to pass multiple arguments to itemgetter was added in 2.5. Meanwhile it's easy enough to define your own: def make_key(item): return (item[:4]) and then specify key=make_key. BTW when you want help with an error, please copy and paste the entire error message and traceback into your email. > I understand that groupby takes 2 parameters the first being the sorted > list. The second is a key and this is where I'm confused. The itemgetter > function is going to return a tuple of functions (f[0],f[1],f[2],f[3]). No, it returns one function that will return a tuple of values. > Should I only be calling itemgetter with whatever element (0 to 3) that > I want to group the items by? If you do that it will only group by the single item you specify. groupby() doesn't sort so you should also sort by the same key. But I don't think that is what you want. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicate entries
On Nov 27, 2007 5:40 PM, Kent Johnson <[EMAIL PROTECTED]> wrote: > > This is a two-liner using itertools.groupby() and operator.itemgetter: > > data = [['Bob', '07129', 'projectA', '4001',5], > ['Bob', '07129', 'projectA', '5001',2], > ['Bob', '07101', 'projectB', '4001',1], > ['Bob', '07140', 'projectC', '3001',3], > ['Bob', '07099', 'projectD', '3001',2], > ['Bob', '07129', 'projectA', '4001',4], > ['Bob', '07099', 'projectD', '4001',3], > ['Bob', '07129', 'projectA', '4001',2] > ] > > import itertools, operator > for k, g in itertools.groupby(sorted(data), key=operator.itemgetter(0, > 1, 2, 3)): > print k, sum(item[4] for item in g) > I'm trying to understand what's going on in the for statement but I'm having troubles. The interpreter is telling me that itemgetter expects 1 argument and is getting 4. I understand that groupby takes 2 parameters the first being the sorted list. The second is a key and this is where I'm confused. The itemgetter function is going to return a tuple of functions (f[0],f[1],f[2],f[3]). Should I only be calling itemgetter with whatever element (0 to 3) that I want to group the items by? I'm almost getting this but not quite. ;) RQ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicate entries
Michael Langford wrote: > What you want is a set of entries. Not really; he wants to aggregate entries. > # remove duplicate entries > # > # myEntries is a list of lists, > #such as [[1,2,3],[1,2,"foo"],[1,2,3]] > # > s=set() > [s.add(tuple(x)) for x in myEntries] A set can be constructed directly from a sequence so this can be written as s=set(tuple(x) for x in myEntries) BTW I personally think it is bad style to use a list comprehension just for the side effect of iteration, IMO it is clearer to write out the loop when you want a loop. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicateentries
> s=set() > [s.add(tuple(x)) for x in myEntries] > myEntries = [list(x) for x in list(s)] This could be written more concisely as... s = set(tuple(x) for x in myEntries) myEntries = [list(x) for x in list(s)] Generator expressions are really cool. Not what the OP asked for exactly. He wanted to eliminate duplicates by adding their last columns. He said the last column is a number of hours that pertain to the first four columns. When you apply your method, it will not get rid of duplicate projects, just circumstances where the number of hours is equivalent also, and of course, not add the hours like they should be. I like the dictionary approach for this personally... di = {} for x in myEntries: try: di[x[:3]]+=x[4] except KeyError: di[x[:3]] = x[4] This can be written even smaller and cleaner if you use the default value method... ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicate entries
What you want is a set of entries. Unfortunately, python lists are not "hashable" which means you have to convert them to something hashable before you can use the python set datatype. What you'd like to do is add each to a set while converting them to a tuple, then convert them back out of the set. In python that is: # # remove duplicate entries # # myEntries is a list of lists, #such as [[1,2,3],[1,2,"foo"],[1,2,3]] # s=set() [s.add(tuple(x)) for x in myEntries] myEntries = [list(x) for x in list(s)] List completions are useful for all sorts of list work, this included. Do not use a database, that would be very ugly and time consuming too. This is cleaner than the dict keys approach, as you'd *also* have to convert to tuples for that. If you need this in non-list completion form, I'd be happy to write one if that's clearer to you on what's happening. --Michael -- Michael Langford Phone: 404-386-0495 Consulting: http://www.RowdyLabs.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicate entries
bob gailer wrote: > 2 - Sort the list. Create a new list with an entry for the first name, > project, workcode. Step thru the list. Each time the name, project, > workcode is the same, accumulate hours. When any of those change, create > a list entry for the next name, project, workcode and again start > accumulating hours. This is a two-liner using itertools.groupby() and operator.itemgetter: data = [['Bob', '07129', 'projectA', '4001',5], ['Bob', '07129', 'projectA', '5001',2], ['Bob', '07101', 'projectB', '4001',1], ['Bob', '07140', 'projectC', '3001',3], ['Bob', '07099', 'projectD', '3001',2], ['Bob', '07129', 'projectA', '4001',4], ['Bob', '07099', 'projectD', '4001',3], ['Bob', '07129', 'projectA', '4001',2] ] import itertools, operator for k, g in itertools.groupby(sorted(data), key=operator.itemgetter(0, 1, 2, 3)): print k, sum(item[4] for item in g) For some explanation see my recent post: http://mail.python.org/pipermail/tutor/2007-November/058753.html Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicate entries
Richard Querin wrote: > I'm trying to process a list and I'm stuck. Hopefully someone can help > me out here: > > I've got a list that is formatted as follows: > [Name,job#,jobname,workcode,hours] > > An example might be: > > [Bob,07129,projectA,4001,5] > [Bob,07129,projectA,5001,2] > [Bob,07101,projectB,4001,1] > [Bob,07140,projectC,3001,3] > [Bob,07099,projectD,3001,2] > [Bob,07129,projectA,4001,4] > [Bob,07099,projectD,4001,3] > [Bob,07129,projectA,4001,2] > > Now I'd like to consolidate entries that are duplicates. Duplicates > meaning entries that share the same Name, job#, jobname and workcode. > So for the list above, there are 3 entries for projectA which have a > workcode of 4001. (there is a fourth entry for projectA but it's > workcode is 5001 and not 4001). > > So I'd like to end up with a list so that the three duplicate entries > are consolidated into one with their hours added up: > > [Bob,07129,projectA,4001,11] > [Bob,07129,projectA,5001,2] > [Bob,07101,projectB,4001,1] > [Bob,07140,projectC,3001,3] > [Bob,07099,projectD,3001,2] > [Bob,07099,projectD,4001,3] There are at least 2 more approaches. 1 - Use sqlite (or some other database), insert the data into the database, then run a sql statement to sum(hours) group by name, project, workcode. 2 - Sort the list. Create a new list with an entry for the first name, project, workcode. Step thru the list. Each time the name, project, workcode is the same, accumulate hours. When any of those change, create a list entry for the next name, project, workcode and again start accumulating hours. The last is IMHO the most straightforward, and easiest to code. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing question - consolidating duplicate entries
On 28/11/2007, Richard Querin <[EMAIL PROTECTED]> wrote: > I've got a list that is formatted as follows: > [Name,job#,jobname,workcode,hours] [...] > Now I'd like to consolidate entries that are duplicates. Duplicates > meaning entries that share the same Name, job#, jobname and workcode. > So for the list above, there are 3 entries for projectA which have a > workcode of 4001. (there is a fourth entry for projectA but it's > workcode is 5001 and not 4001). You use a dictionary: pull out the jobname and workcode as the dictionary key. import operator # if job is an element of the list, then jobKey(job) will be (jobname, workcode) jobKey = operator.itemgetter(2, 3) jobList = [...] # the list of jobs jobDict = {} for job in jobList: try: jobDict[jobKey(job)][4] += job[4] except KeyError: jobDict[jobKey(job)] = job (note that this will modify the jobs in your original list... if this is Bad, you can replace the last line with "... = job[:]") HTH! -- John. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing
> I have a load of files I need to process. Each line of a file looks > something like this: > > eYAL001C1 Spar 81 3419 4518 4519 2 1 > > So basically its a table, separated with tabs. What I need to do is make a > new file where all the entries in the table are those where the values in > columns 1 and 5 were present as a pair more than once in the original file. My immediate answer would be to use awk. However if that's not possible or desirable then look at the fileinput module and the string.split function. Alan G ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing
[EMAIL PROTECTED] wrote: > Hi, > > I have a load of files I need to process. Each line of a file looks > something like this: > > eYAL001C1 Spar81 3419451845192 1 > > So basically its a table, separated with tabs. What I need to do is make a > new file where all the entries in the table are those where the values in > columns 1 and 5 were present as a pair more than once in the original file. > > I really have very little idea how to achiev this. So far I read in the > file to a list , where each item in the list is a list of the entries on a > line. I would do this with two passes over the data. The first pass would accumulate lines and count pairs of (col1, col5); the second pass would output the lines whose count is > 1. Something like this (untested): lines = [] counts = {} # Build a list of split lines and count the (col1, col5) pairs for line in open('input.txt'): line = line.split() # break line on tabs key = (line[1], line[5]) # or (line[0], line[4]) depending on what you mean by col 1 counts[key] = counts.get(key, 0) + 1 # count the key pair lines.append(line) # Output the lines whose pairs appear more than once f = open('output.txt', 'w') for line in lines: if counts[(line[1], line[5])] > 1: f.write('\t'.join(line)) f.write('\n') f.close() Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing
On 1 Jun 2005 [EMAIL PROTECTED] wrote: > eYAL001C1 Spar81 3419451845192 1 > > So basically its a table, separated with tabs. What I need to do is make > a new file where all the entries in the table are those where the values > in columns 1 and 5 were present as a pair more than once in the original > file. This is half-baked, but I toss it out in case anyone can build on it. Create a dictionary, keyed on column 1. Read a line and split it into the columns. For each line, create a dictionary entry that is a dictionary keyed by column 5, whose entry is a list of lists, the inner list of which contains columns 2, 3, 4 and 6. When a dupe is found, add an additional inner list. So, upon processing this line, you have a dictionary D: {'eYAL001C1': {'4518': [['Spar', '3419', '4519', '2', '1']]}} As you process each new line, one of three things is true: 1) Col 1 is used as a key, but col5 is not used as an inner key; 2) Col 1 is used as a key, and col5 is used as an inner key 3) Col 1 is not used as a key So, for each new line: if col1 in d.keys(): if col5 in d[col1].keys() d[col1][col5].append([col2, col3, col4, col6]) else d[col1][col5] = [[col2, col3, col4, col6]] else: d[col1]={col5:[[col2, col3, col4, col6] The end result is that you'll have all your data from the file in the form of a dictionary indexed by column 1. Each entry in the top-level dictionary is a second-level dictionary indexed by column 2. Each entry in that second-level dictionary is a list of lists, and each list in that list of lists is columns 2, 3, 4 and 6. if the list of lists has a length of 1, then the col1/col5 combo only appears once in the input file. But if it has a length > 1, it occurred more than once, and satisfies you condition of "columns 1 and 5 were present as a pair more than once" So to get at these: for key1 in d: for key2 in d[key1]: if len(d[key1][key2]) > 1: for l in d[key1][key2]: print key1, l[0], l[1], l[2], key2, l[3] I haven't tested this approach (or syntax) but I think the approach is basically sound. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] List processing
On 1 Jun 2005 [EMAIL PROTECTED] wrote: > I have a load of files I need to process. [text cut] > So basically its a table, separated with tabs. What I need to do is make > a new file where all the entries in the table are those where the values > in columns 1 and 5 were present as a pair more than once in the original > file. Hi Chris, Have you thought about sorting? If you sort them based on specific columns, then elements with the same columns will cluster together in runs. So you may not even need Python much in this case; pipine your input through a 'sort -k1,5' might do the brunt of the work. If you want to do this with Python alone, that's doable too in a fairly straightforward way. Are you familiar with the "dictionary" data structure yet? Best of wishes to you! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor