Re: Comparing Big Lists
On Mon, 29 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote: > I tried your suggestion of turning smallList into an associative > array with the index for each element equal to the text I'm looking for > in bigList. I think I must have misunderstood your suggestion because > the handler runs much slower than previously, perhaps because I've got > it asking for the keys of smallList for every line of bigList. Exactly: keys() is a very expensive operation because it has to traverse the whole structure and pull out specific items and put them into a new buffer. > Here's what I tried. > > -- Note. smallListArray array is an array made out of the original > smallList variable > > repeat for each line i in bigList > if item 6 of i keys(smallListArray) The trick here is that smallListArray will not necessarily have any data in it, it's the keys that are significant. So just put "x" into each element when building the array and then check to see if smallListArray[item 6 of i] is empty as you traverse the big list. Regards, Scott > then >put i into hitList[item 6 of i] > end if >end repeat Scott Raney [EMAIL PROTECTED] http://www.metacard.com MetaCard: You know, there's an easier way to do that... ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing Big Lists
Thanks Dave, You put a lot of work into your example, and I appreciate it. Sorry, I didn't look at it sooner. I suspect Dave's approach below (which is what I suspect Scott was suggesting as well, although I misunderstood it) will be fastest. To make it more general, I might have to nest another 'repeat for...' loop to capture cases where the item to compared to in tLargeList is not always located in item 3. For those curious, I'm helping a colleague search through the file pertaining to the human genome database. It's about 80 MB. Regards, Greg Gregory Lypny Associate Professor John Molson School of Business Concordia University _ "Absence of evidence is not evidence of absence." - Anonymous http://rubbersoul.concordia.ca -- a snip from Dave Cragg's test... repeat for each line tLine in tLargeList if tArray[item 3 of tLine] <> empty then put tArray[item 3 of tLine] & ":" & tLine & cr after tMergeList2 end if end repeat ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing Big Lists
Hi Scott, I tried your suggestion of turning smallList into an associative array with the index for each element equal to the text I'm looking for in bigList. I think I must have misunderstood your suggestion because the handler runs much slower than previously, perhaps because I've got it asking for the keys of smallList for every line of bigList. Here's what I tried. -- Note. smallListArray array is an array made out of the original smallList variable repeat for each line i in bigList if item 6 of i keys(smallListArray) then put i into hitList[item 6 of i] end if end repeat Message: 3 Subject: Re: Comparing big lists Date: Sat, 27 Apr 2002 16:10:42 -0400 From: Gregory Lypny <[EMAIL PROTECTED]> To: "MetaCard List" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Thanks for the suggestion, Scott. I'll give it a shot. I've also tried looping over the lines of bigList (i.e., a nested repeat), simply using the 'in' operator: if x is in y, then... It takes about 6 minutes on a modest (300 mHz) iBook running OS X, but I'm hoping for an improvement, Regards, Greg On 27/4/2002 12:08 PM, [EMAIL PROTECTED] wrote: Message: 2 Date: Fri, 26 Apr 2002 12:48:53 -0600 (MDT) From: Scott Raney <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: Re: Comparing big lists Reply-To: [EMAIL PROTECTED] On: Thu, 25 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote: Thought I would pick your brains on the topic of comparing two big lists. Both are tab delimited. bigList has about 100,000 lines and 6 items (columns) per line. smallList is about 15,000 lines and 2 items per line. I want to identify the lines in bigList in which the third item is the same as the second item in a line in smallList, and then pull out the intersection. I used something like this, which works fine. set the itemDelimiter to tab repeat for each line j of smallList put lineOffset(item 2 of j, bigList) into thisLine if thisLine is not 0 then put j & tab & \ line thisLine of bigList & return after mergedList end repeat delete last character of mergedList -- Get rid of the trailing Return Using the lineOffset function seemed the obvious choice to me, but I'm also interested in other approaches. LineOffset on such a big variable is going to be pretty expensive. Another option would be to us split to build an array out of smallList and the loop over each line in big list and see if there is an array index for it. Split takes awhile and will use up a good bit of memory, but makes the lookups *much* faster. You could save some of that space by building up an array of just the relevant items in one list or the other by looping over the lines and creating one array index for each. Regards, Scott Regards, Greg ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
At 6:48 pm -0700 28/4/02, erik hansen wrote: >how about a short example? Not so short. :) Here's a "useless" example to illustrate the cost of incrementing chunk expressions in a long loop vs using "repeat for each". It simply copies a list, first using chunk expressions, next using "repeat for each" Vary the value of tListSize to see how the first method gets exponentially slower, whereas the second method's time grows proportionally. on mouseUp #build a list of numbers put 1000 into tListSize repeat tListSize put random(100) & cr after tList end repeat delete char -1 of tList put empty into tOutList #test 1 put the milliseconds into tStart put the number of lines of tList into tNumLines repeat with i = 1 to tNumLines put line i of tList & cr after tOutList end repeat delete char -1 of tOutlist put the milliseconds - tStart into tTime put "Test 1:" && tTime into tResult put empty into tOutList #test 2 put the milliseconds into tStart repeat for each line tLine in tList put tLine & cr after tOutList end repeat delete char -1 of tOutlist put the milliseconds - tStart into tTime put cr & "Test 2:" && tTime after tResult answer tResult end mouseUp Here's an example that "kind of" deals with Greg's original question. This builds two lists of numbers, one of two columns and one of 6. It rigs the columns to be compared by prepending and appending an "x". This prevents lineOffset in the first method finding the value in the wrong column, and avoids the problem of substrings matching. The value in the second column of the small list is unique. (I take it from Greg's mail that his data was something like this, but in general, you'll need to do more work with both methods.) The first method uses lineOffset in a loop. The second method builds an array from one list, and then loops through the second list line at a time checking if there is an array entry for the item being compared. The time difference is substantial. This doesn't mean there isn't a place for using the offset functions. For example, if searching how many times a single word occurs in some text, it works fine, and there's no overhead of building an array. But when comparing two large lists, it can get real slow. on mouseUp #build two lists for testing put 1 into tLargeListSize put 1000 into tSmallListSize repeat tLargeListSize put empty into tLargeLine repeat 6 put random(100) & comma after tLargeLine end repeat put "x" before item 3 of tLargeLine put "x" after item 3 of tLargeLine delete char -1 of tLargeLine put tLargeLine & cr after tLargeList end repeat delete char -1 of tLargeList repeat with i = 1 to tSmallListSize put empty into tSmallLine put random(1) & comma & "x" & i & "x" after tSmallLine put tSmallLine & cr after tSmallList end repeat delete char -1 of tSmallList sort lines of tSmallList by item 1 of each ##random sort #start test #test 1 -- using lineOffset put the milliseconds into tStart -- repeat for each line j in tLargeList put lineOffset(item 3 of j, tSmallList) into tOff if tOff is not 0 then put line tOff of tSmallList & ":" & j & cr after tMergeList1 end if end repeat delete char -1 of tMergeList1 put the milliseconds - tStart into tTime put "Test 1:" && tTime into tResult #test 2 -- using an array put the milliseconds into tStart - #build array of item 2 of small list repeat for each line tLine in tSmallList put tLine into tArray[item 2 of tLine] ##to make tests comparable end repeat repeat for each line tLine in tLargeList if tArray[item 3 of tLine] <> empty then put tArray[item 3 of tLine] & ":" & tLine & cr after tMergeList2 end if end repeat delete char -1 of tMergeList2 - put the milliseconds - tStart into tTime put cr & "Test 2:" && tTime after tResult #put tMergeList1 into field 1 #put tMergeList2 into field 2 answer tResult end mouseUp Cheers Dave Cragg ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
on 27/4/02 9:10 PM, Gregory Lypny at [EMAIL PROTECTED] wrote: > Thanks for the suggestion, Scott. I'll give it a shot. I've also tried > looping over the lines of bigList (i.e., a nested repeat), simply using > the 'in' operator: if x is in y, then... It takes about 6 minutes on a > modest (300 mHz) iBook running OS X, but I'm hoping for an improvement, > I wrote a general version of exactly this recently (take two tab and return files; from each select one column to match on; from each select columns to output; then choose to output the merged file, and/or the discards from one or other file). My first version used a 'clever' function based on lineoffset to locate all the matching lines. With inputs of 50,000 and 10,000 lines, it took about 3 minutes on a 400Mhz TiBook (OS9). Then I rewrote it to put one file into an array indexed on the requested column, eg set the itemdelimiter to tab put empty into srcDataBarray repeat for each line r in srcDataB put r into srcDataBarray[(item LinkColB of r)] end repeat and loop through the other file (using repeat for each, of course, not repeat with a line number) testing against the array. It went down to a few seconds. Then I took out the progress feedback - now it is consistently 1 second or less. It's so fast that the difference between indexing the 'small' and 'large' files is undetectable; and it does all the extra stuff (of making up all three lists, the merged and the two discard sets) every time, only checking at the end which of the three I actually asked it to save. The combination of 'repeat for each' and MC/Rev's hashed arrays is just blinding. A fantastic illustration of why a fourth generation language can not only give fast development, but also fast execution. Ben Rubinstein | Email: [EMAIL PROTECTED] Cognitive Applications Ltd | Phone: +44 (0)1273-821600 http://www.cogapp.com| Fax : +44 (0)1273-728866 ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
how about a short example? thanks, erik --- Dave Cragg <[EMAIL PROTECTED]> wrote: > It think you'll find adding stuff at the end of > a variable extremely > fast using "put x after y". However, the "line > thisLine of bigList" > part will get progressively slower as the value > of thisLine > increases. This is why the general advice is > to use "repeat for > each" where possible and avoid using > incremented chunk expressions > which have to count through the data each time. > However, going back > to the original example, if you don't expect to > find many matches, > that part may not be so costly. On the other > hand, lineOffset will > get called for each line, making it fairly > expensive. > > So, if possible, in large loops: > > -- use "repeat for each" (or "split" as Scott > suggested) > -- use "put x after y" > -- avoid using "line n of y" > -- avoid using lineOffset = [EMAIL PROTECTED] http://www.erikhansen.org __ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
At 11:32 pm + 27/4/02, jbv wrote: >I'm wondering : what is the most expensive ? > >Is it lineoffset, or is it : > put j & tab & line thisLine of bigList & return after mergedList > >I vaguely recall some discussion about adding new lines at the end of >a variable inside a loop, and the loop getting slower & slower... >I'm even pretty sure it was already a problem in HC and/or in OMO... > It think you'll find adding stuff at the end of a variable extremely fast using "put x after y". However, the "line thisLine of bigList" part will get progressively slower as the value of thisLine increases. This is why the general advice is to use "repeat for each" where possible and avoid using incremented chunk expressions which have to count through the data each time. However, going back to the original example, if you don't expect to find many matches, that part may not be so costly. On the other hand, lineOffset will get called for each line, making it fairly expensive. So, if possible, in large loops: -- use "repeat for each" (or "split" as Scott suggested) -- use "put x after y" -- avoid using "line n of y" -- avoid using lineOffset Cheers Dave Cragg ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
I'm wondering : what is the most expensive ? Is it lineoffset, or is it : put j & tab & line thisLine of bigList & return after mergedList I vaguely recall some discussion about adding new lines at the end of a variable inside a loop, and the loop getting slower & slower... I'm even pretty sure it was already a problem in HC and/or in OMO... Anyone ? JB ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
On Saturday, April 27, 2002, at 02:10 PM, Gregory Lypny wrote: > It takes about 6 minutes on a > modest (300 mHz) iBook running OS X, but I'm hoping for an improvement, Maybe. Maybe if you are able to keep the lists sorted you can come up with a faster way that takes advantage of that. Dar Scott ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
Thanks for the suggestion, Scott. I'll give it a shot. I've also tried looping over the lines of bigList (i.e., a nested repeat), simply using the 'in' operator: if x is in y, then... It takes about 6 minutes on a modest (300 mHz) iBook running OS X, but I'm hoping for an improvement, Regards, Greg On 27/4/2002 12:08 PM, [EMAIL PROTECTED] wrote: >Message: 2 >Date: Fri, 26 Apr 2002 12:48:53 -0600 (MDT) >From: Scott Raney <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Subject: Re: Comparing big lists >Reply-To: [EMAIL PROTECTED] > >On: Thu, 25 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote: > >> Thought I would pick your brains on the topic of comparing two big >> lists. Both are tab delimited. bigList has about 100,000 lines and >> 6 items (columns) per line. smallList is about 15,000 lines and 2 >> items per line. I want to identify the lines in bigList in which >> the third item is the same as the second item in a line in >> smallList, and then pull out the intersection. I used something >> like this, which works fine. > >> set the itemDelimiter to tab >> repeat for each line j of smallList >>put lineOffset(item 2 of j, bigList) into thisLine >>if thisLine is not 0 then put j & tab & \ >> line thisLine of bigList & return after mergedList >> end repeat >> delete last character of mergedList -- Get rid of the trailing Return > >> Using the lineOffset function seemed the obvious choice to me, but I'm >> also interested in other approaches. > >LineOffset on such a big variable is going to be pretty expensive. >Another option would be to us split to build an array out of smallList >and the loop over each line in big list and see if there is an array >index for it. Split takes awhile and will use up a good bit of >memory, but makes the lookups *much* faster. You could save some of >that space by building up an array of just the relevant items in one >list or the other by looping over the lines and creating one array >index for each. > Regards, >Scott > >> Regards, >> Greg > > >Scott Raney [EMAIL PROTECTED] http://www.metacard.com >MetaCard: You know, there's an easier way to do that... ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: Comparing big lists
On: Thu, 25 Apr 2002 Gregory Lypny <[EMAIL PROTECTED]> wrote: > Thought I would pick your brains on the topic of comparing two big > lists. Both are tab delimited. bigList has about 100,000 lines and > 6 items (columns) per line. smallList is about 15,000 lines and 2 > items per line. I want to identify the lines in bigList in which > the third item is the same as the second item in a line in > smallList, and then pull out the intersection. I used something > like this, which works fine. > set the itemDelimiter to tab > repeat for each line j of smallList >put lineOffset(item 2 of j, bigList) into thisLine >if thisLine is not 0 then put j & tab & \ > line thisLine of bigList & return after mergedList > end repeat > delete last character of mergedList -- Get rid of the trailing Return > Using the lineOffset function seemed the obvious choice to me, but I'm > also interested in other approaches. LineOffset on such a big variable is going to be pretty expensive. Another option would be to us split to build an array out of smallList and the loop over each line in big list and see if there is an array index for it. Split takes awhile and will use up a good bit of memory, but makes the lookups *much* faster. You could save some of that space by building up an array of just the relevant items in one list or the other by looping over the lines and creating one array index for each. Regards, Scott > Regards, > Greg Scott Raney [EMAIL PROTECTED] http://www.metacard.com MetaCard: You know, there's an easier way to do that... ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Comparing big lists
Hi Everyone, Thought I would pick your brains on the topic of comparing two big lists. Both are tab delimited. bigList has about 100,000 lines and 6 items (columns) per line. smallList is about 15,000 lines and 2 items per line. I want to identify the lines in bigList in which the third item is the same as the second item in a line in smallList, and then pull out the intersection. I used something like this, which works fine. set the itemDelimiter to tab repeat for each line j of smallList put lineOffset(item 2 of j, bigList) into thisLine if thisLine is not 0 then put j & tab & \ line thisLine of bigList & return after mergedList end repeat delete last character of mergedList -- Get rid of the trailing Return Using the lineOffset function seemed the obvious choice to me, but I'm also interested in other approaches. Regards, Greg Gregory Lypny Associate Professor John Molson School of Business Concordia University _ "Absence of evidence is not evidence of absence." - Anonymous http://rubbersoul.concordia.ca