Re: Sorting a more efficient way

Sherm Pendley Wed, 09 Mar 2005 18:18:13 -0800

On Mar 9, 2005, at 7:30 PM, Mark Wheeler wrote:

I have another question. Below is a "sample" of a larger script. I have it working great, but I'm pretty sure the sorting and manipulation of file data is inefficient. What I'm asking is, how do I make the code more efficient?

Devel::DProf is a useful tool for profiling your code to see where it's actually spending its time. It's based on the 90/10 rule - 90% of your program's time is spent in 10% of the code, so that's where you want to concentrate your efforts.

However, the data here is far too small a sample to accurately measure, so I'll take a guess. :-)

Here is a sample of what the data file looks like:

CATAGORY!:CATAGORY2:CATAGORY3"CATAGORY4
3245:CATAGORY2:7:July 7,2005:data:data:data
5764:CATAGORY1:3:April 27, 2005:data:data:data
8749:CATAGOTY2:4:April 13, 2005:data:data:data
9874:CATAGORY4:7:February 10, 2006:data:data:data
0276:CATAGORY1:4:November 20, 2005:data:data:data

The script reads in the data file, the sorts all the records by the CATAGORY order found in the first line of the data file. Then each line item (now delimited by a "+") within each CATAGORY is sorted by date. After they are sorted, they are joined again by the "+" so I can display each line of data within each catagory. Here is the code:

<CODE>

my ($i, $html, @sorted);

open (DEALS, "< $pathtodatafile") || die "$!";
flock (DEALS, 2);
my @current_deals = <DEALS>;
close (DEALS);

chomp ($current_deals[0]);
my @order = split(/:/, $current_deals[0]);

#====================================================
# Sort deals into catagory and date order
#====================================================

my ($found, $count) = 0;
foreach (@order) {
 for ($i=1; $i<=$#current_deals; $i++) {
  my ($deal, $catagory, $y) = split (/:/, $current_deals[$i]);
  if ($catagory eq $_) {
   $sorted[$count] .= $current_deals[$i]."+";
   $found = 1;
  }
 }
 if ($found == 1) {
  chop ($sorted[$count]);
 }
 $count++;
 $found = 0;
}

for ($i=0; $i<=$#sorted; $i++) {
my @sorted_split_temp = split (/\+/, $sorted[$i]);

my @sorted_split_final = sort {
Date_Cmp(ParseDate((split /:/, $a)[3]), ParseDate((split /:/, $b)[3]));
} @sorted_split_temp;

$sorted[$i] = join ("\+", @sorted_split_final);

</CODE>

Like I said, it's hard to do more than guess with such a small sample of data. But, from what I can see, I can see one fairly obvious bottleneck - there's a *lot* of splitting and joining of strings going on. And - and this *really* hurts - two split()s and two calls to ParseDate(), which is a really slow function, for each comparison. What I would do is prep the data before sorting it. That way the comparison function, which will be called a *lot*, can be streamlined:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use Date::Manip;

    # Read data into @current_deals
    my @current_deals = (<DATA>);

# First make a lookup table from the first line of the data # The keys in this table are the categories, and values their sort order my $catline = shift @current_deals; chomp $catline; my $cat_order = 0; my %category_table = (); for (split /:/, $catline) { $category_table{$_} = $cat_order; $cat_order++; }

    # Now, we take advantage of the fact that, within a foreach() or
    # for() loop, $_ is an *alias* to the original element in the list.
    # So, modifying $_ modifies the actual list data.
    # Here, we split each row, and pre-parse the date column
    foreach (@current_deals) {
        chomp;
        my @row = split(/:/, $_);
        $row[3] = ParseDate($row[3]);
        $_ = [EMAIL PROTECTED];
    }

    # Now sort the list, comparing the date fields only if the
    # categories are equal
    @current_deals = sort {
        $category_table{$a->[1]} <=> $category_table{$b->[1]} ||
        Date_Cmp($a->[3], $b->[3])
    } @current_deals;

    # Transform the dates back to the original format
    foreach (@current_deals) {
        $_->[3] = UnixDate($_->[3], "%B %e, %Y");
    }

    for (@current_deals) {
        print join(':', @$_), "\n";
    }

    __END__
    CATAGORY1:CATAGORY2:CATAGORY3:CATAGORY4
    3245:CATAGORY2:7:July 7,2005:data:data:data
    5764:CATAGORY1:3:April 27, 2005:data:data:data
    8749:CATAGORY2:4:April 13, 2005:data:data:data
    9874:CATAGORY4:7:February 10, 2006:data:data:data
    0276:CATAGORY1:4:November 20, 2005:data:data:data

The above code still parses all the dates - but it does so outside the comparison function, in a simple loop that parses each date only once. That makes the overall cost of date parsing O(n) instead of O(n**2) - which should be a big improvement for such an expensive function. It's also worth noting that a sort like the one used above (i.e. "@foo = sort @foo") is an optimized case in the 5.9 development tree, so that the array is sorted in-place instead of being copied first. That should be a major improvement once 5.10 is released.

Obligatory disclaimers apply: No guarantees, naturally, concerning the performance of the above. I spent all of about twenty minutes on it, and I didn't profile it, or your code. It might even be slower than your code. It does at least compile and appears to sort the sample data correctly though.

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org

Re: Sorting a more efficient way

Reply via email to