Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-29 Thread Kent Johnson
Richard Querin wrote:
> import itertools, operator
> for k, g in itertools.groupby(sorted(data), key=operator.itemgetter(0,
> 1, 2, 3)):
>   print k, sum(item[4] for item in g)
> 
> 
> 
> I'm trying to understand what's going on in the for statement but I'm 
> having troubles. The interpreter is telling me that itemgetter expects 1 
> argument and is getting 4.

You must be using an older version of Python, the ability to pass 
multiple arguments to itemgetter was added in 2.5. Meanwhile it's easy 
enough to define your own:
def make_key(item):
   return (item[:4])

and then specify key=make_key.

BTW when you want help with an error, please copy and paste the entire 
error message and traceback into your email.

> I understand that groupby takes 2 parameters the first being the sorted 
> list. The second is a key and this is where I'm confused. The itemgetter 
> function is going to return a tuple of functions (f[0],f[1],f[2],f[3]).

No, it returns one function that will return a tuple of values.

> Should I only be calling itemgetter with whatever element (0 to 3) that 
> I want to group the items by?

If you do that it will only group by the single item you specify. 
groupby() doesn't sort so you should also sort by the same key. But I 
don't think that is what you want.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-29 Thread Richard Querin
On Nov 27, 2007 5:40 PM, Kent Johnson <[EMAIL PROTECTED]> wrote:

>
> This is a two-liner using itertools.groupby() and operator.itemgetter:
>
> data = [['Bob', '07129', 'projectA', '4001',5],
> ['Bob', '07129', 'projectA', '5001',2],
> ['Bob', '07101', 'projectB', '4001',1],
> ['Bob', '07140', 'projectC', '3001',3],
> ['Bob', '07099', 'projectD', '3001',2],
> ['Bob', '07129', 'projectA', '4001',4],
> ['Bob', '07099', 'projectD', '4001',3],
> ['Bob', '07129', 'projectA', '4001',2]
> ]
>
> import itertools, operator
> for k, g in itertools.groupby(sorted(data), key=operator.itemgetter(0,
> 1, 2, 3)):
>   print k, sum(item[4] for item in g)
>


I'm trying to understand what's going on in the for statement but I'm having
troubles. The interpreter is telling me that itemgetter expects 1 argument
and is getting 4.

I understand that groupby takes 2 parameters the first being the sorted
list. The second is a key and this is where I'm confused. The itemgetter
function is going to return a tuple of functions (f[0],f[1],f[2],f[3]).

Should I only be calling itemgetter with whatever element (0 to 3) that I
want to group the items by?

I'm almost getting this but not quite. ;)

RQ
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-28 Thread Kent Johnson
Michael Langford wrote:
> What you want is a set of entries.

Not really; he wants to aggregate entries.

> # remove duplicate entries
> #
> #  myEntries is a list of lists,
> #such as [[1,2,3],[1,2,"foo"],[1,2,3]]
> #
> s=set()
> [s.add(tuple(x)) for x in myEntries]

A set can be constructed directly from a sequence so this can be written as
  s=set(tuple(x) for x in myEntries)

BTW I personally think it is bad style to use a list comprehension just 
for the side effect of iteration, IMO it is clearer to write out the 
loop when you want a loop.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicateentries

2007-11-28 Thread Tiger12506
> s=set()
> [s.add(tuple(x)) for x in myEntries]
> myEntries = [list(x) for x in list(s)]

This could be written more concisely as...

s = set(tuple(x) for x in myEntries)
myEntries = [list(x) for x in list(s)]

Generator expressions are really cool.

Not what the OP asked for exactly. He wanted to eliminate duplicates by 
adding their last columns. He said the last column is a number of hours that 
pertain to the first four columns. When you apply your method, it will not 
get rid of duplicate projects, just circumstances where the number of hours 
is equivalent also, and of course, not add the hours like they should be. I 
like the dictionary approach for this personally...

di = {}
for x in myEntries:
try: di[x[:3]]+=x[4]
except KeyError: di[x[:3]] = x[4]

This can be written even smaller and cleaner if you use the default value 
method... 

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-27 Thread Michael Langford
What you want is a set of entries. Unfortunately, python lists are not
"hashable" which means you have to convert them to something hashable
before you can use the python set datatype.

What you'd like to do is add each to a set while converting them to a
tuple, then convert them back out of the set. In python that is:

#
# remove duplicate entries
#
#  myEntries is a list of lists,
#such as [[1,2,3],[1,2,"foo"],[1,2,3]]
#
s=set()
[s.add(tuple(x)) for x in myEntries]
myEntries = [list(x) for x in list(s)]

List completions are useful for all sorts of list work, this included.

Do not use a database, that would be very ugly and time consuming too.

This is cleaner than the dict keys approach, as you'd *also* have to
convert to tuples for that.

If you need this in non-list completion form, I'd be happy to write
one if that's clearer to you on what's happening.

  --Michael
-- 
Michael Langford
Phone: 404-386-0495
Consulting: http://www.RowdyLabs.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-27 Thread Kent Johnson
bob gailer wrote:
> 2 - Sort the list. Create a new list with an entry for the first name, 
> project, workcode. Step thru the list. Each time the name, project, 
> workcode is the same, accumulate hours. When any of those change, create 
> a list entry for the next name, project, workcode and again start 
> accumulating hours.

This is a two-liner using itertools.groupby() and operator.itemgetter:

data = [['Bob', '07129', 'projectA', '4001',5],
['Bob', '07129', 'projectA', '5001',2],
['Bob', '07101', 'projectB', '4001',1],
['Bob', '07140', 'projectC', '3001',3],
['Bob', '07099', 'projectD', '3001',2],
['Bob', '07129', 'projectA', '4001',4],
['Bob', '07099', 'projectD', '4001',3],
['Bob', '07129', 'projectA', '4001',2]
]

import itertools, operator
for k, g in itertools.groupby(sorted(data), key=operator.itemgetter(0, 
1, 2, 3)):
   print k, sum(item[4] for item in g)

For some explanation see my recent post:
http://mail.python.org/pipermail/tutor/2007-November/058753.html

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-27 Thread bob gailer
Richard Querin wrote:
> I'm trying to process a list and I'm stuck. Hopefully someone can help
> me out here:
>
> I've got a list that is formatted as follows:
> [Name,job#,jobname,workcode,hours]
>
> An example might be:
>
> [Bob,07129,projectA,4001,5]
> [Bob,07129,projectA,5001,2]
> [Bob,07101,projectB,4001,1]
> [Bob,07140,projectC,3001,3]
> [Bob,07099,projectD,3001,2]
> [Bob,07129,projectA,4001,4]
> [Bob,07099,projectD,4001,3]
> [Bob,07129,projectA,4001,2]
>
> Now I'd like to consolidate entries that are duplicates. Duplicates
> meaning entries that share the same Name, job#, jobname and workcode.
> So for the list above, there are 3 entries for projectA which have a
> workcode of 4001. (there is a fourth entry for projectA but it's
> workcode is 5001 and not 4001).
>
> So I'd like to end up with a list so that the three duplicate entries
> are consolidated into one with their hours added up:
>
> [Bob,07129,projectA,4001,11]
> [Bob,07129,projectA,5001,2]
> [Bob,07101,projectB,4001,1]
> [Bob,07140,projectC,3001,3]
> [Bob,07099,projectD,3001,2]
> [Bob,07099,projectD,4001,3]
There are at least 2 more approaches.

1 - Use sqlite (or some other database), insert the data into the 
database, then run a sql statement to sum(hours) group by name, project, 
workcode.

2 - Sort the list. Create a new list with an entry for the first name, 
project, workcode. Step thru the list. Each time the name, project, 
workcode is the same, accumulate hours. When any of those change, create 
a list entry for the next name, project, workcode and again start 
accumulating hours.

The last is IMHO the most straightforward, and easiest to code.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing question - consolidating duplicate entries

2007-11-27 Thread John Fouhy
On 28/11/2007, Richard Querin <[EMAIL PROTECTED]> wrote:
> I've got a list that is formatted as follows:
> [Name,job#,jobname,workcode,hours]
[...]
> Now I'd like to consolidate entries that are duplicates. Duplicates
> meaning entries that share the same Name, job#, jobname and workcode.
> So for the list above, there are 3 entries for projectA which have a
> workcode of 4001. (there is a fourth entry for projectA but it's
> workcode is 5001 and not 4001).

You use a dictionary: pull out the jobname and workcode as the dictionary key.


import operator

# if job is an element of the list, then jobKey(job) will be (jobname, workcode)
jobKey = operator.itemgetter(2, 3)

jobList = [...]  # the list of jobs

jobDict = {}

for job in jobList:
  try:
jobDict[jobKey(job)][4] += job[4]
  except KeyError:
jobDict[jobKey(job)] = job

(note that this will modify the jobs in your original list... if this
is Bad, you can replace the last line with "... = job[:]")

HTH!

-- 
John.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing

2005-06-01 Thread Alan G
> I have a load of files I need to process. Each line of a file looks
> something like this:
>
> eYAL001C1 Spar 81 3419 4518 4519 2 1
>
> So basically its a table, separated with tabs. What I need to do is
make a
> new file where all the entries in the table are those where the
values in
> columns 1 and 5 were present as a pair more than once in the
original file.

My immediate answer would be to use awk.

However if that's not possible or desirable then look at the fileinput
module and the string.split function.

Alan G

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing

2005-06-01 Thread Kent Johnson
[EMAIL PROTECTED] wrote:
> Hi,
> 
> I have a load of files I need to process. Each line of a file looks 
> something like this:
> 
> eYAL001C1 Spar81  3419451845192   1   
> 
> So basically its a table, separated with tabs. What I need to do is make a 
> new file where all the entries in the table are those where the values in 
> columns 1 and 5 were present as a pair more than once in the original file.
> 
> I really have very little idea how to achiev this. So far I read in the 
> file to a list , where each item in the list is a list of the entries on a 
> line.

I would do this with two passes over the data. The first pass would accumulate 
lines and count pairs 
of (col1, col5); the second pass would output the lines whose count is > 1. 
Something like this 
(untested):

lines = []
counts = {}

# Build a list of split lines and count the (col1, col5) pairs
for line in open('input.txt'):
   line = line.split()  # break line on tabs
   key = (line[1], line[5])  # or (line[0], line[4]) depending on what you mean 
by col 1
   counts[key] = counts.get(key, 0) + 1  # count the key pair
   lines.append(line)

# Output the lines whose pairs appear more than once
f = open('output.txt', 'w')
for line in lines:
   if counts[(line[1], line[5])] > 1:
 f.write('\t'.join(line))
 f.write('\n')
f.close()

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing

2005-06-01 Thread Terry Carroll
On 1 Jun 2005 [EMAIL PROTECTED] wrote:

> eYAL001C1 Spar81  3419451845192   1   
> 
> So basically its a table, separated with tabs. What I need to do is make
> a new file where all the entries in the table are those where the values
> in columns 1 and 5 were present as a pair more than once in the original
> file.

This is half-baked, but I toss it out in case anyone can build on it.

Create a dictionary, keyed on column 1.  Read a line and split it into 
the columns.  For each line, create a dictionary entry that is a 
dictionary keyed by column 5, whose entry is a list of lists, the inner 
list of which contains columns 2, 3, 4 and 6.  When a dupe is found, add 
an additional inner list.

So, upon processing this line, you have a dictionary D:

{'eYAL001C1': {'4518': [['Spar', '3419', '4519', '2', '1']]}}

As you process each new line, one of three things is true:

 1) Col 1 is used as a key, but col5 is not used as an inner key;
 2) Col 1 is used as a key, and col5 is used as an inner key
 3) Col 1 is not used as a key

So, for each new line:

 if col1 in d.keys():
if col5 in d[col1].keys()
  d[col1][col5].append([col2, col3, col4, col6])
else
  d[col1][col5] = [[col2, col3, col4, col6]]
 else:
  d[col1]={col5:[[col2, col3, col4, col6]


The end result is that you'll have all your data from the file in the form 
of a dictionary indexed by column 1.  Each entry in the top-level 
dictionary is a second-level dictionary indexed by column 2.  Each entry 
in that second-level dictionary is a list of lists, and each list in that 
list of lists is columns 2, 3, 4 and 6.

if the list of lists has a length of 1, then the col1/col5 combo only 
appears once in the input file.  But if it has a length > 1, it occurred 
more than once, and satisfies you condition of "columns 1 and 5 were 
present as a pair more than once"

So to get at these:

 for key1 in d:
   for key2 in d[key1]:
if len(d[key1][key2]) > 1:
  for l in d[key1][key2]:
print key1, l[0], l[1], l[2], key2, l[3]

I haven't tested this approach (or syntax) but I think the approach is 
basically sound.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] List processing

2005-06-01 Thread Danny Yoo


On 1 Jun 2005 [EMAIL PROTECTED] wrote:

> I have a load of files I need to process.

[text cut]

> So basically its a table, separated with tabs. What I need to do is make
> a new file where all the entries in the table are those where the values
> in columns 1 and 5 were present as a pair more than once in the original
> file.


Hi Chris,

Have you thought about sorting?

If you sort them based on specific columns, then elements with the same
columns will cluster together in runs.  So you may not even need Python
much in this case; pipine your input through a 'sort -k1,5' might do the
brunt of the work.

If you want to do this with Python alone, that's doable too in a fairly
straightforward way.  Are you familiar with the "dictionary" data
structure yet?


Best of wishes to you!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor