Re: Naive question on lists of duplicates

2003-06-08 Thread Dylan Thurston
On Sat, Jun 07, 2003 at 08:24:41PM -0500, Stecher, Jack wrote:
It sounds like you're on the right track...

  You could get a moderately more efficient implementation by keeping
  the active list as a heap rather than a list.
 
 I had thought about that, and took the BinomialHeap.hs file from
 Okasaki, but I must have a typo somewhere, because I was having typing
 clashes that I couldn't easily clarify.  At least, when I loaded the
 BinomialHeap.hs into Hugs, it didn't complain, but when I tried to
 create an empty heap using the heapEmpty function, Hugs screamed at me.
 I got scared and fled the scene, retreating into the safety of lists.

I don't think you should worry about this now, but the problem was
problem that heapEmpty returns something like 'Heap a', for an
undetermined type variable 'a'; you may need to specify the type of
your empty heap in order for Hugs not to complain.

Peace,
Dylan


pgp0.pgp
Description: PGP signature


Re: Naive question on lists of duplicates

2003-06-07 Thread Dylan Thurston
On Thu, Jun 05, 2003 at 08:09:02AM -0500, Stecher, Jack wrote:
 I have an exceedingly simple problem to address, and am wondering if
 there are relatively straightforward ways to improve the efficiency
 of my solution.

Was there actually a problem with the efficiency of your first code?

 The task is simply to look at a lengthy list of stock keeping units
 (SKUs -- what retailers call individual items), stores, dates that a
 promotion started, dates the promotion ended, and something like
 sales amount; we want to pull out the records where promotions
 overlap.  I will have dates in mmdd format, so there's probably
 no harm in treating them as Ints.

(Unless this is really a one-shot deal, I suspect using Ints for dates
is a bad decision...)

 My suggestion went something like this (I'm not at my desk so I
 don't have exactly what I typed):

I have a different algorithm, which should be nearly optimal, but I
find it harder to describe than to show the code (which is untested):

 import List(sortBy, insertBy)

 data PromotionRec  = PR {sku :: String, store :: String, startDate :: Int, endDate 
 :: Int, amount::Float}

 compareStart, compareEnd :: PromotionRec - PromotionRec - Ordering
 compareStart x y = compare (startDate x) (startDate y)
 compareEnd x y = compare (endDate x) (endDate y)

 overlap :: [PromoRec] - [[PromoRec]]
 overlap l = filter (lambda l. length l  1) 
(overlap' [] (sortBy compareStart l))

 overlap' _ [] = []
 overlap' active (x:xs) =
   let {active' = dropWhile (lambda y. endDate y  startDate x) active} in
   (x:active') : overlap' (insertBy compareEnd x active') xs

The key is that, by keeping a list of the currently active promotions
in order sorted by the ending date, we only need to discared an
initial portion of the list.

You could get a moderately more efficient implementation by keeping
the active list as a heap rather than a list.

Peace,
Dylan


pgp0.pgp
Description: PGP signature


RE: Naive question on lists of duplicates

2003-06-07 Thread Stecher, Jack
Thanks so much for the reply.

On Thu, Jun 06, 2003 at 08:06 PM, Dylan Thurston wrote:
 Was there actually a problem with the efficiency of your first code?

No -- it was untested, and as I developed more of the code in Haskell, I
found that there were slight differences between what I thought we
needed to do and what my co-worker had in mind.  So, even if my
co-worker goes back to writing things in SAS, this has already been
helpful for clarifying requirements.  In particular, it turns out that
he needs to know when promotions overlap for the same SKU, as this is an
indication that the data we've received are invalid.

I tested the mergeSort to make sure I didn't have any errors, though I
used the fairly standard one.  It was much faster than a sort in SAS, at
least with a PROC SQL, though I ran into size limitations because I had
to compile on the Windows side.  (We don't have gcc installed on our
UNIX side, so ghci will work but ghc won't compile my code.)  Since I'm
analyzing roughly 600MB of data, I will probably add a step or two in
order to chunk the data.  In any case, I'm not noticing any obvious
inefficiencies, but my Haskell programming is self-taught, and this is
my first attempt at solving a real problem at work using Haskell, so I
figured I'd get feedback from the list.

 (Unless this is really a one-shot deal, I suspect using Ints for dates
 is a bad decision...)

When I first read this message, I thought, Well, this really is only a
one-shot deal.  However, as I thought more about this, I decided that I
should write this so that we can use it in case another client comes
along with similar data for us to analyze on another project.  So I will
change this and define a date type, or else I'll read the ghc
documentation and learn how to use what's in the time library.

 import List(sortBy, insertBy)

 data PromotionRec  = PR {sku :: String, store :: String, startDate ::
Int, endDate :: Int, amount::Float}

As it turns out, the same promotions are always on across all stores.
Thus, each SKU/startDate/endDate will seem to have a duplicate
promotion.  Also, I don't need the amount information, so I wrote the
import function so that it would ignore the amount information.  One
further change:  the SKU and store are identified by their number, not
their descriptions.

The new data type is like this:
data PromoRec = PR {startDate, endDate, skuID, storeID :: Int}
deriving Show
though I will likely implement a better show function and will, as noted
above, change the dates to a date type.

The file is a fixed-width file, so I just do this:
loadData :: Handle - IO [String]
loadData handle  =  do s - hGetContents handle
return (lines s)

toPromoRecs:: [String] - [PromoRec]
toPromoRecs [] =  []
toPromoRecs (x:xs) =  (PR (read (take 8 x)) (read (take 8 (drop
8 x)))
  (read (take 20 (drop 16 x)))
  (read (take 20 (drop 36 x :
toPromoRecs xs

There's probably a much more clever way to do this, but it seems fairly
fast.  The file allowed 20 spaces for the SKU and store numbers, but
they're at most 7 digits long.  Again, I'll need to change the first and
second reads to some function that converts a string in the form
mmdd into a date type (or I'll need to write one).

 compareStart, compareEnd :: PromotionRec - PromotionRec - Ordering
 compareStart x y = compare (startDate x) (startDate y)
 compareEnd x y = compare (endDate x) (endDate y)

This is helpful; I'll also need to add a function like this:
compareSku  :: PromoRec - PromoRec - Ordering
compareSku x y  =  compare (skuID x) (skuID y)

 overlap :: [PromoRec] - [[PromoRec]]
 overlap l = filter (lambda l. length l  1) 
(overlap' [] (sortBy compareStart l))

 overlap' _ [] = []
 overlap' active (x:xs) =
   let {active' = dropWhile (lambda y. endDate y  startDate x) active}
in
   (x:active') : overlap' (insertBy compareEnd x active') xs

Okay; I see what additional modifications I should make.  I need to add
a step sort by store and delete duplicate stores where the SKU, start
date, and end date are the same.  But I should be able to get there from
here.

 The key is that, by keeping a list of the currently active promotions
 in order sorted by the ending date, we only need to discared an
 initial portion of the list.

Good insight.  Many thanks.

 You could get a moderately more efficient implementation by keeping
 the active list as a heap rather than a list.

I had thought about that, and took the BinomialHeap.hs file from
Okasaki, but I must have a typo somewhere, because I was having typing
clashes that I couldn't easily clarify.  At least, when I loaded the
BinomialHeap.hs into Hugs, it didn't complain, but when I tried to
create an empty heap using the heapEmpty function, Hugs screamed at me.
I got scared and fled the scene, 

Naive question on lists of duplicates

2003-06-06 Thread Stecher, Jack
Hi, all.
 
I have an exceedingly simple problem to address, and am wondering if there are 
relatively straightforward ways to improve the efficiency of my solution.
 
The task is simply to look at a lengthy list of stock keeping units (SKUs -- what 
retailers call individual items), stores, dates that a promotion started, dates the 
promotion ended, and something like sales amount; we want to pull out the records 
where promotions overlap.  I will have dates in mmdd format, so there's probably 
no harm in treating them as Ints.
 
My suggestion went something like this (I'm not at my desk so I don't have exactly 
what I typed):
 
 data PromotionRec  = PR {sku :: String, store :: String, startDate :: Int, endDate 
 :: Int, amount::Float}

 match  ::  PromotionRec - [PromotionRec] - [PromotionRec]
 match _ []   = []
 match x (x:xs) =  filter (meet x) xs

 nomatch ::  PromoRec - [PromotionRec] - [PromotionRec]
 nomatch _ []  = []
 nomatch x (x:xs)=  filter (nomeet x) xs

 meet  :: PromoRec - PromoRec - Bool
 meet x y= and [startDate x = endDate y, startDate y  = startDate x]

 nomeet  :: PromoRec - PromoRec - Bool
 nomeet x y= not (meet x y)

 overlaps:: [PromoRec] - [[PromoRec]]
 overlaps []= []
 overlaps (x:xs)  =  (x : match x xs) : (overlaps (nomatch x xs))

 overlap :: [PromoRec] - [[PromoRec]]
 overlap [] = []
 overlap (x:xs)   = filter g1 (overlaps (x:xs)) where
g1 ys = (length ys)  1
 
What I sent might have been slightly different, and I might have some typos in the 
above (as I'm composing from memory at the keyboard), but that's the rough idea:  
treat this like a quicksort, pulling out the records that overlap the first record, 
consing the first record onto the list, and then consing the resulting list onto the 
same function applied to the records that didn't overlap.  As a last step, delete the 
records that were not on promotion while anything else was on promotion.
 
I'm pretty confident that this will be more efficient than my colleague's SAS code, as 
he was comparing each record to every other record (giving n (n-1) comparisons).  It 
seems like this, in the worst case where everything is on promotion at distinct times, 
will compare the first record to (n-1) records, the second to (n-2) records, etc., 
giving n (n-1)/2 comparisons.  Thus, while this is worst-case O(n^2), it seems like it 
should have at most half as much work to do as the earlier approach in SAS.  On the 
other hand, in the best case, when everything is on promotion at the same time, there 
should be something like n-1 comparisons made and then that should be it.  So it seems 
like, depending on how frequently promotions co-occur for this retailer, the above 
should be somewhere between O(n) and O(n^2) time complexity.  Since these people are 
always having some sale or other, I suspect this won't be much worse than O(n log n).
 
Is there anything that is an obvious improvement in efficiency -- some clever tricks 
with the fold functions, some way of loading the data into some sort of convenient 
structure first, etc -- that I could easily implement?
 
Please share your thoughts.
 
Many thanks,
Jack Stecher
[EMAIL PROTECTED]
G¥”Ÿzf¢–)à–+-«$zYBi÷¡jÉ–Z+‚m§ÿðÃZ²G¥–Šàþf¢–f§þX¬¶)øZ²G¥•

Re: Naive question on lists of duplicates

2003-06-06 Thread Sarah Thompson

I'm pretty confident that this will be more efficient than my colleague's SAS code, as he was comparing each record to every other record (giving n (n-1) comparisons).  It seems like this, in the worst case where everything is on promotion at distinct times, will compare the first record to (n-1) records, the second to (n-2) records, etc., giving n (n-1)/2 comparisons.  Thus, while this is worst-case O(n^2), it seems like it should have at most half as much work to do as the earlier approach in SAS.  On the other hand, in the best case, when everything is on promotion at the same time, there should be something like n-1 comparisons made and then that should be it.  So it seems like, depending on how frequently promotions co-occur for this retailer, the above should be somewhere between O(n) and O(n^2) time complexity.  Since these people are always having some sale or other, I suspect this won't be much worse than O(n log n).

Is there anything that is an obvious improvement in efficiency -- some clever tricks with the fold functions, some way of loading the data into some sort of convenient structure first, etc -- that I could easily implement?
 

Firstly I'm assuming that you are working with a granularity of days, and that each promotion will always have a relatively small maximum number of days. If so, how about something like the following:

1: Filter the existing data structure, resulting in a lazy list of tuples (a, b) where a is a day number and b is an identifier for the promotion. Where a promotion spans n days, the list will contain n entries, one for each day of the promotion. Complexity is O(M x N), where M is the (small) maximum number of days. If this is fixed *and* small, we can discard this any regard the complexity as just O(N).

2: Sort the list with a as the primary key and b as the secondary key. Complexity should be near enough O(N log N)

3: Traverse the results of the sort, outputting a lazy list of lists such that the elements of each sub-list are references to the promotions that overlap for one specific day number. Where no overlap is detected for one specific day, that day can simply be ignored. Complexity should be linear.

4: Sort this list, and discard duplicates. Complexity should be O(N log N) for the sort and O(N) for the 'uniq'.

5: You are now left with a list which describes all overlapping promotions.

Total complexity should effectively be O(N + N log N + N + N log N + N) which of course just collapses to O(N log N).

--
 --
/ __+ / Sarah Thompson   /
   / (_  _  _ _ |_   /* /
  /  __)(_|| (_|| ) / [EMAIL PROTECTED]  * /
 / +   / http://findatlantis.com/ /
--
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe