Brian van den Broek wrote:
incorporating some of Wolfram's and Kent's (hope I've missed no one) suggestions:

<code>
def dups_in_list_report(a_list):
    '''Prints a duplication report for a list.'''

    items_dict = {}

    for i in a_list:
        items_dict[i] = items_dict.get(i, 0) + 1

    for k, v in sorted(items_dict.iteritems()):   # cf below
        if v > 1:
            print '%s occurred %s times' %(k, v)
</code>

And, I can't but agree that this is much better! Thanks folks.

In trying to improve the code, I first had:

    for key in sorted(items_dict.keys()):
        if items_dict[key] > 1:
            print '%s occurred %s times' %(key, items_dict[key])

in place of the for loop over .iteritems(). Am I right in thinking that the advantage of Kent's suggestion of .iteritems() is that it eliminates some of the dict lookups? Other advantages?

iteritems() doesn't create an intermediate list so that should be a slight time and memory savings. And my guess is that it is faster because it saves the lookups but if you care you should test. Mostly I prefer iteritems() because it seems cleaner and clearer to me.



Finally, in the first instance, I was aiming for the OP's stated end. To make this more general and reusable, I think I'd do:


<code>
def get_list_dup_dict(a_list, threshold=1):
    '''Returns a dict of items in list that occur threshold many times

    threshold defaults to 1. The dict returned has items occurring at least
    threshold many times as keys, and number of occurrences as values.
    '''

    items_dict, dup_dict = {}, {}   # Question below

    for i in a_list:
        items_dict[i] = items_dict.get(i, 0) + 1

    for k, v in items_dict.iteritems():
        if v >= threshold:
            dup_dict[k] = v    #Question below

    return dup_dict

def print_list_dup_report(a_list, threshold=1):
'''Prints report of items in a_list occurring at least threshold many times


threshold defaults to 1. get_list_dup_dict(a_list, threshold=0) is called.
returning a dict of items in list that occur at least threshold many times
as keys and their number of repetitions as values.


    This dict is looped over to print a sorted and formatted duplication
    report.
    '''

    dup_dict = get_list_dup_dict(a_list, threshold)
    for k, v in sorted(dup_dict.iteritems()):
        print '%s occurred %s times' %(k, v)
</code>


My question (from comment in new code):

Since I've split the task into two functions, one to return a duplication dictionary, the other to print a report based on it, I think the distinct dup_dict is needed. (I do want the get_list_dup_dict function to return a dict for possible use in other contexts.)

You could have a function that gets all of the counts, and then filter on threshold when you print.

The alternative would be to iterate over a .copy() of the items_dict and delete items not meeting the threshold from items_dict, returning the pruned items_dict at the end. But, dup_dict is guaranteed to be smaller, save for original lists with no duplications and threshold set to 1. So, the distinct dup_dict way seems better for memory.


Am I overlooking yet another dict technique that would help, here? Any other improvements?

In Python 2.4 I think you could do dup_dict = dict((k, v) for k, v in items_dict.iteritems() if v >= threshold) which I think is likely to be quite fast (but again if you care you should test).

Kent


Thanks and best to all,

Brian vdB

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor

Reply via email to