Wagner, David --- Senior Programmer Analyst --- WGO wrote:
> On 4/1/06, Frank Bax <[EMAIL PROTECTED]> wrote:
>>> At 06:59 PM 3/31/06, Mr. Shawn H. Corey wrote:
>>>
>>>> On Fri, 2006-31-03 at 15:45 -0800, Tom Phoenix wrote:
>>>>> You should loop over the input, pushing each item on to an array.
>>>>> If at any time you have 2000 items in the array, sort them and
>>>>> discard any you don't want to keep.
>>>>>
>>>>> $#data = 999 if $#data > 999; # OBperl: one way to
>>>>> discard elements
>>>>>
>>>>> When you finish looping, sort-and-discard again. You'll never need
>>>>> more than 2N items in memory at any given time. Does that
>>>>> algorithm work for your needs?
>>>>
>>>> Sorting is not necessary. If he keeps an array of the best, that
>>>> means lowest, records then all he has to do is compare every new
>>>> entry with the highest. This is called "fail early." This means,
>>>> if it's going to fail, it should at the earliest opportunity. If
>>>> it succeeds then it searches down thru the list, to find its
>>>> place. This is called "succeed early." Given that the procedure
>>>> can flip between these two methods, it is faster than any sort.
>>>
>>>
>>> I'm not the OP, but I have a script with a similar problem. The
>>> script has some logic that generates many (thousands of billions)
>>> of combinations from a little bit of data and only the best 100
>>> combos are output. For each combination produced by script, a value
>>> is calculated. I maintain an array of "best" results so far; if
>>> size of this array is <100, then a "new" result is automatically
>>> added. If there are already 100 items, I find the "worst" entry in
>>> the array. If the "new" entry is better than the "worst", the
>>> "new" one replaces the "worst".
>>>
>>> I am not sure how your reference to "fail early" and "succeed
>>> early" apply to this situation.
>>>
>>> At the moment, the array is left unsorted. If I use a sorted
>>> array, it needs to resorted every time a "worst" entry is replaced
>>> by a "new"
>>> entry. Can I avoid sorting the array every iteration? How else to
>>> find the "worst" entry? If I replace "worst" with a "new" entry,
>>> doesn't the array need to be resorted? How does the cost of
>>> searching an unsorted array compare to resorting on every update
>>> and searching a sorted array.
>>>
>>> Tom's idea about growing to 200, then chopping back to 100 also
>>> sounds interesting.
>>>
>>
>> Sorting is expensive, but keeping a sorted array isn't always,
>> especially if you know that the majority of your data will fall
>> outside your window, preferrably high. I'd probably do something
>> like the following:
>>
>> #!/usr/bin/perl
>>
>> use warnings;
>> use strict;
>>
>> my $keeprecs = 10; # the number of records you want to keep
>> my @data = (10,24,5,32,4,9,8,7,6,1,2,0,3);
>> my @array;
>>
>> while (@data) { # probably while <> in production
>> my $number = pop @data ;
>> if ($number > $array[-1]) {
>> next if @array >= $keeprecs;
>> # fail early if the number is high and we
>> # already have $keeprecs records saved
>> push @array, $number;
>> } elsif ($number < $array[0]) {
>> unshift @array, $number;
>> } else {
>> my($i, $idx);
>> for ($i = 0; $i < @array; $i++) {
>> next if $i == 0;
>> if (($number < $array[$i]) and ($number > $array[$i-1]))
>> { splice @array,$i,0,$number;
>> last;
>> }
>> }
>> }
>> pop @array if @array > $keeprecs;
>> }
>>
>> print "@array\n";
>>
> I am going to give this a try, but the data is truly a random set of
> numbers and whether most will be hi or lo, I will leave that for the
> processing. Originally it took a little over 3 hours to do the
> processing.
>
> I have added three arrays and one hash.
> a) array 1 holds the numberbs being compared ( sorted )
> b) array 2 holds a key to the hash of data needed later associated
> with each number c) array 3 has a set of 2000 keys I can rotate
> through as I get hits and use the keys. If key is dropped then I push
> back on the array and I have at twice the number needed, so should
> never have aproblem.
>
> d) hash data associated w a) ranges from 60 to 800 bytes per
> number.
>
> Again thanks to the list for the help.
>
> Shawn, I looked at your response, but did not truly understand, so
> depending on what happens here, I may need to give yours a short, but
> I was not following it. Sorry.
>
> Wags ;)
I ran it twice over the evening and it added about 8 minutes in
processing for the 3 hours 9 minutes from the original processing. I generated
an output after every 100k rcds( 100/rcds printed ). For my actual processing,
I think I will dump very hour.
Thanks again for the insight to give it a shot and for sharing.
Wags ;)
>
>> The efficiency here will depend on the distribution of the data. If
>> every new piece of data is > $array[-2] and < $array[-1], then it's
>> something on the order of O($keeprecs * N). In the best case
>> scenario, with all records falling above the current max, it's O(N).
>> In the real world, who knows, but but it should be reasonably fast.
>>
>> In any case, the key to a project like this is to figure out quickly
>> whether you want to keep a particular peice of data, and spend as
>> little time as possible on the out of bounds data which, if you want
>> 1000 records out of 1,000,000 should be the vast majority of it.
>>
>> HTH
>> -- jay
>
>
>
> *******************************************************
> This message contains information that is confidential
> and proprietary to FedEx Freight or its affiliates.
> It is intended only for the recipient named and for
> the express purpose(s) described therein.
> Any other use is prohibited.
> *******************************************************
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>