Re: Comparing Big Lists

2002-04-29 Thread Scott Raney

On Mon, 29 Apr 2002  Gregory Lypny <[EMAIL PROTECTED]> wrote:

>   I tried your suggestion of turning smallList into an associative 
> array with the index for each element equal to the text I'm looking for 
> in bigList.  I think I must have misunderstood your suggestion because 
> the handler runs much slower than previously, perhaps because I've got 
> it asking for the keys of smallList for every line of bigList.

Exactly: keys() is a very expensive operation because it has to
traverse the whole structure and pull out specific items and put them
into a new buffer.

>  Here's what I tried.
> 
> -- Note. smallListArray array is an array made out of the original 
> smallList variable
> 
> repeat for each line i in bigList
>  if item 6 of i  keys(smallListArray)

The trick here is that smallListArray will not necessarily have any
data in it, it's the keys that are significant.  So just put "x" into
each element when building the array and then check to see if
smallListArray[item 6 of i] is empty as you traverse the big list.
  Regards,
Scott

>  then
>put i into hitList[item 6 of i]
>  end if
>end repeat


Scott Raney  [EMAIL PROTECTED]  http://www.metacard.com
MetaCard: You know, there's an easier way to do that...

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing Big Lists

2002-04-29 Thread Gregory Lypny

Thanks Dave,

You put a lot of work into your example, and I appreciate it.  
Sorry, I didn't look at it sooner.

I suspect Dave's approach below (which is what I suspect Scott was 
suggesting as well, although I misunderstood it) will be fastest.  To 
make it more general, I might have to nest another 'repeat for...' loop 
to capture cases where the item to compared to in tLargeList is not 
always located in item 3.

For those curious, I'm helping a colleague search through the file 
pertaining to the human genome database.  It's about 80 MB.


Regards,

Greg

Gregory Lypny
Associate Professor
John Molson School of Business
Concordia University
_
"Absence of evidence is not evidence of absence."
 - Anonymous

http://rubbersoul.concordia.ca


-- a snip from Dave Cragg's test...
repeat for each line tLine in tLargeList
  if tArray[item 3 of tLine] <> empty then
put tArray[item 3 of tLine] & ":" & tLine & cr after tMergeList2
  end if
end repeat

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing Big Lists

2002-04-29 Thread Gregory Lypny

Hi Scott,

I tried your suggestion of turning smallList into an associative 
array with the index for each element equal to the text I'm looking for 
in bigList.  I think I must have misunderstood your suggestion because 
the handler runs much slower than previously, perhaps because I've got 
it asking for the keys of smallList for every line of bigList.  Here's 
what I tried.

-- Note. smallListArray array is an array made out of the original 
smallList variable

repeat for each line i in bigList
 if item 6 of i  keys(smallListArray)
 then
   put i into hitList[item 6 of i]
 end if
   end repeat


Message: 3
Subject: Re: Comparing big lists
Date: Sat, 27 Apr 2002 16:10:42 -0400
From: Gregory Lypny <[EMAIL PROTECTED]>
To: "MetaCard List" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]

Thanks for the suggestion, Scott.  I'll give it a shot.  I've also tried
looping over the lines of bigList (i.e., a nested repeat), simply using
the 'in' operator:  if x is in y, then...  It takes about 6 minutes on a
modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,

  Regards,

   Greg

On 27/4/2002 12:08 PM, [EMAIL PROTECTED] wrote:

Message: 2
Date: Fri, 26 Apr 2002 12:48:53 -0600 (MDT)
From: Scott Raney <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: Comparing big lists
Reply-To: [EMAIL PROTECTED]

On: Thu, 25 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote:

   Thought I would pick your brains on the topic of comparing two big
lists.  Both are tab delimited.  bigList has about 100,000 lines and
6 items (columns) per line.  smallList is about 15,000 lines and 2
items per line.  I want to identify the lines in bigList in which
the third item is the same as the second item in a line in
smallList, and then pull out the intersection.  I used something
like this, which works fine.

 set the itemDelimiter to tab
   repeat for each line j of smallList
put lineOffset(item 2 of j, bigList) into thisLine
if thisLine is not 0 then put j & tab & \
 line thisLine of bigList  & return after 
mergedList
   end repeat
 delete last character of mergedList  -- Get rid of the trailing 
Return

Using the lineOffset function seemed the obvious choice to me, but I'm
also interested in other approaches.

LineOffset on such a big variable is going to be pretty expensive.
Another option would be to us split to build an array out of smallList
and the loop over each line in big list and see if there is an array
index for it.  Split takes awhile and will use up a good bit of
memory, but makes the lookups *much* faster.  You could save some of
that space by building up an array of just the relevant items in one
list or the other by looping over the lines and creating one array
index for each.
  Regards,
Scott

 Regards,
 Greg

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-29 Thread Dave Cragg

At 6:48 pm -0700 28/4/02, erik hansen wrote:
>how about a short example?

Not so short. :)

Here's a "useless" example to illustrate the cost of incrementing 
chunk expressions in a long loop vs using "repeat for each". It 
simply copies a list, first using chunk expressions, next using 
"repeat for each" Vary the value of tListSize to see how the first 
method gets exponentially slower, whereas the second method's time 
grows proportionally.

on mouseUp
   #build a list of numbers
   put 1000 into tListSize
   repeat tListSize
 put random(100) & cr after tList
   end repeat
   delete char -1 of tList

   put empty into tOutList
   #test 1
   put the milliseconds into tStart
   put the number of lines of tList into tNumLines
   repeat with i = 1 to tNumLines
 put line i of tList & cr after tOutList
   end repeat
   delete char -1 of tOutlist
   put the milliseconds - tStart into tTime
   put "Test 1:" && tTime into tResult
   put empty into tOutList
   #test 2
   put the milliseconds into tStart
   repeat for each line tLine in tList
 put tLine & cr after tOutList
   end repeat
   delete char -1 of tOutlist
   put the milliseconds - tStart into tTime
   put cr & "Test 2:" && tTime after tResult
   answer tResult
end mouseUp

Here's an example that "kind of" deals with Greg's original question. 
This builds two lists of numbers, one of two columns and one of 6. It 
rigs the columns to be compared by prepending and appending an "x". 
This prevents lineOffset in the first method finding the value in the 
wrong column, and avoids the problem of substrings matching. The 
value in the second column of the small list is unique. (I take it 
from Greg's mail that his data was something like this, but in 
general, you'll need to do more work with both methods.)

The first method uses lineOffset in a loop. The second method builds 
an array from one list, and then loops through the second list line 
at a time checking if there is an array entry for the item being 
compared. The time difference is substantial.

This doesn't mean there isn't a place for using the offset functions. 
For example, if searching how many times a single word occurs in some 
text, it works fine, and there's no overhead of building an array. 
But when comparing two large lists,  it can get real slow.

on mouseUp
   #build two lists for testing
   put 1 into tLargeListSize
   put 1000 into tSmallListSize
   repeat tLargeListSize
 put empty into tLargeLine
 repeat 6
   put random(100)  & comma after tLargeLine
 end repeat
 put "x" before item 3 of tLargeLine
 put "x" after item 3 of tLargeLine
 delete char -1 of tLargeLine
 put tLargeLine & cr after tLargeList
   end repeat
   delete char -1 of tLargeList

   repeat with i = 1 to tSmallListSize
 put empty into tSmallLine
 put random(1) & comma & "x" & i & "x" after tSmallLine
 put tSmallLine & cr after tSmallList
   end repeat
   delete char -1 of tSmallList
   sort lines of tSmallList by item 1 of each ##random sort

   #start test
   #test 1 -- using lineOffset
   put the milliseconds into tStart
   --
   repeat for each line j in tLargeList
 put lineOffset(item 3 of j, tSmallList) into tOff
 if tOff is not 0 then
   put line tOff of tSmallList & ":" & j  & cr after tMergeList1
 end if
   end repeat
   delete char -1 of tMergeList1
   
   put the milliseconds - tStart into tTime
   put "Test 1:" && tTime into tResult

   #test 2 -- using an array
   put the milliseconds into tStart
   -
   #build array of item 2 of small list
   repeat for each line tLine in tSmallList
 put tLine into tArray[item 2 of tLine] ##to make tests comparable
   end repeat

   repeat for each line tLine in tLargeList
 if tArray[item 3 of tLine] <> empty then
   put tArray[item 3 of tLine] & ":" & tLine & cr after tMergeList2

 end if
   end repeat
   delete char -1 of tMergeList2
   -
   put the milliseconds - tStart into tTime
   put cr & "Test 2:" && tTime after tResult
   #put tMergeList1 into field 1
   #put tMergeList2 into field 2

   answer tResult
end mouseUp



Cheers
Dave Cragg
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-29 Thread Ben Rubinstein

on 27/4/02 9:10 PM, Gregory Lypny at [EMAIL PROTECTED] wrote:

> Thanks for the suggestion, Scott.  I'll give it a shot.  I've also tried
> looping over the lines of bigList (i.e., a nested repeat), simply using
> the 'in' operator:  if x is in y, then...  It takes about 6 minutes on a
> modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,
> 

I wrote a general version of exactly this recently (take two tab and return
files; from each select one column to match on; from each select columns to
output; then choose to output the merged file, and/or the discards from one
or other file).  My first version used a 'clever' function based on
lineoffset to locate all the matching lines.  With inputs of 50,000 and
10,000 lines, it took about 3 minutes on a 400Mhz TiBook (OS9).  Then I
rewrote it to put one file into an array indexed on the requested column, eg

  set the itemdelimiter to tab
  put empty into srcDataBarray
  repeat for each line r in srcDataB
put r into srcDataBarray[(item LinkColB of r)]
  end repeat

and loop through the other file (using repeat for each, of course, not
repeat with a line number) testing against the array.  It went down to a few
seconds.  Then I took out the progress feedback - now it is consistently 1
second or less.   It's so fast that the difference between indexing the
'small' and 'large' files is undetectable; and it does all the extra stuff
(of making up all three lists, the merged and the two discard sets) every
time, only checking at the end which of the three I actually asked it to
save.

The combination of 'repeat for each' and MC/Rev's hashed arrays is just
blinding.  A fantastic illustration of why a fourth generation language can
not only give fast development, but also fast execution.
 
  Ben Rubinstein   |  Email: [EMAIL PROTECTED]
  Cognitive Applications Ltd   |  Phone: +44 (0)1273-821600
  http://www.cogapp.com|  Fax  : +44 (0)1273-728866


___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-28 Thread erik hansen


how about a short example?
thanks, erik

--- Dave Cragg <[EMAIL PROTECTED]> wrote:
> It think you'll find adding stuff at the end of
> a variable extremely 
> fast using "put x after y". However, the "line
> thisLine of bigList" 
> part will get progressively slower as the value
> of thisLine 
> increases.  This is why the general  advice is
> to use "repeat for 
> each" where possible and avoid using
> incremented chunk expressions 
> which have to count through the data each time.
> However, going back 
> to the original example, if you don't expect to
> find many matches, 
> that part may not be so costly. On the other
> hand, lineOffset will 
> get called for each line, making it fairly
> expensive.
> 
> So, if possible, in large loops:
> 
> -- use "repeat for each" (or "split" as Scott
> suggested)
> -- use "put x after y"
> -- avoid using "line n of y"
> -- avoid using lineOffset


=
[EMAIL PROTECTED] http://www.erikhansen.org

__
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-28 Thread Dave Cragg

At 11:32 pm + 27/4/02, jbv wrote:
>I'm wondering : what is the most expensive ?
>
>Is it lineoffset, or is it :
> put j & tab & line thisLine of bigList  & return after mergedList
>
>I vaguely recall some discussion about adding new lines at the end of
>a variable inside a loop, and the loop getting slower & slower...
>I'm even pretty sure it was already a problem in HC and/or in OMO...
>

It think you'll find adding stuff at the end of a variable extremely 
fast using "put x after y". However, the "line thisLine of bigList" 
part will get progressively slower as the value of thisLine 
increases.  This is why the general  advice is to use "repeat for 
each" where possible and avoid using incremented chunk expressions 
which have to count through the data each time. However, going back 
to the original example, if you don't expect to find many matches, 
that part may not be so costly. On the other hand, lineOffset will 
get called for each line, making it fairly expensive.

So, if possible, in large loops:

-- use "repeat for each" (or "split" as Scott suggested)
-- use "put x after y"
-- avoid using "line n of y"
-- avoid using lineOffset

Cheers
Dave Cragg
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-27 Thread jbv

I'm wondering : what is the most expensive ?

Is it lineoffset, or is it :
put j & tab & line thisLine of bigList  & return after mergedList

I vaguely recall some discussion about adding new lines at the end of
a variable inside a loop, and the loop getting slower & slower...
I'm even pretty sure it was already a problem in HC and/or in OMO...

Anyone ?

JB



___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-27 Thread Dar Scott


On Saturday, April 27, 2002, at 02:10 PM, Gregory Lypny wrote:

> It takes about 6 minutes on a
> modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,

Maybe.  Maybe if you are able to keep the lists sorted you can come 
up with a faster way that takes advantage of that.

Dar Scott

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-27 Thread Gregory Lypny

Thanks for the suggestion, Scott.  I'll give it a shot.  I've also tried 
looping over the lines of bigList (i.e., a nested repeat), simply using 
the 'in' operator:  if x is in y, then...  It takes about 6 minutes on a 
modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,

 Regards,

  Greg

On 27/4/2002 12:08 PM, [EMAIL PROTECTED] wrote: 

>Message: 2
>Date: Fri, 26 Apr 2002 12:48:53 -0600 (MDT)
>From: Scott Raney <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Subject: Re: Comparing big lists
>Reply-To: [EMAIL PROTECTED]
>
>On: Thu, 25 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote:
>
>>   Thought I would pick your brains on the topic of comparing two big
>> lists.  Both are tab delimited.  bigList has about 100,000 lines and
>> 6 items (columns) per line.  smallList is about 15,000 lines and 2
>> items per line.  I want to identify the lines in bigList in which
>> the third item is the same as the second item in a line in
>> smallList, and then pull out the intersection.  I used something
>> like this, which works fine.
>
>> set the itemDelimiter to tab
>>   repeat for each line j of smallList
>>put lineOffset(item 2 of j, bigList) into thisLine
>>if thisLine is not 0 then put j & tab & \
>> line thisLine of bigList  & return after mergedList
>>   end repeat
>> delete last character of mergedList  -- Get rid of the trailing Return
>
>> Using the lineOffset function seemed the obvious choice to me, but I'm
>> also interested in other approaches.
>
>LineOffset on such a big variable is going to be pretty expensive.
>Another option would be to us split to build an array out of smallList
>and the loop over each line in big list and see if there is an array
>index for it.  Split takes awhile and will use up a good bit of
>memory, but makes the lookups *much* faster.  You could save some of
>that space by building up an array of just the relevant items in one
>list or the other by looping over the lines and creating one array
>index for each.
>  Regards,
>Scott
>
>> Regards,
>> Greg
>
>
>Scott Raney  [EMAIL PROTECTED]  http://www.metacard.com
>MetaCard: You know, there's an easier way to do that...
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Re: Comparing big lists

2002-04-26 Thread Scott Raney

On: Thu, 25 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote:

>   Thought I would pick your brains on the topic of comparing two big
> lists.  Both are tab delimited.  bigList has about 100,000 lines and
> 6 items (columns) per line.  smallList is about 15,000 lines and 2
> items per line.  I want to identify the lines in bigList in which
> the third item is the same as the second item in a line in
> smallList, and then pull out the intersection.  I used something
> like this, which works fine.

> set the itemDelimiter to tab
>   repeat for each line j of smallList
>put lineOffset(item 2 of j, bigList) into thisLine
>if thisLine is not 0 then put j & tab & \
> line thisLine of bigList  & return after mergedList
>   end repeat
> delete last character of mergedList  -- Get rid of the trailing Return

> Using the lineOffset function seemed the obvious choice to me, but I'm
> also interested in other approaches.

LineOffset on such a big variable is going to be pretty expensive.
Another option would be to us split to build an array out of smallList
and the loop over each line in big list and see if there is an array
index for it.  Split takes awhile and will use up a good bit of
memory, but makes the lookups *much* faster.  You could save some of
that space by building up an array of just the relevant items in one
list or the other by looping over the lines and creating one array
index for each.
  Regards,
Scott

> Regards,
> Greg


Scott Raney  [EMAIL PROTECTED]  http://www.metacard.com
MetaCard: You know, there's an easier way to do that...

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard



Comparing big lists

2002-04-25 Thread Gregory Lypny
Hi Everyone,

Thought I would pick your brains on the topic of comparing two big lists.  Both are tab delimited.  bigList has about 100,000 lines and 6 items (columns) per line.  smallList is about 15,000 lines and 2 items per line.  I want to identify the lines in bigList in which the third item is the same as the second item in a line in smallList, and then pull out the intersection.  I used something like this, which works fine.

set the itemDelimiter to tab
repeat for each line j of smallList
put lineOffset(item 2 of j, bigList) into thisLine
if thisLine is not 0 then put j & tab & \
line thisLine of bigList  & return after mergedList
end repeat
delete last character of mergedList  -- Get rid of the trailing Return

Using the lineOffset function seemed the obvious choice to me, but I'm also interested in other approaches.

Regards,

Greg


Gregory Lypny
Associate Professor
John Molson School of Business
Concordia University
_
"Absence of evidence is not evidence of absence."
- Anonymous

http://rubbersoul.concordia.ca