Chunks vs Arrays - surprising benchmarking results

Richard Gaskin Wed, 05 Aug 2009 12:05:31 -0700

The "Multi dimensional array filtering" thread reminded me of abenchmarking test I've been wanting to do for some time, and since Ihave some tasks coming up on a client project which needs this sort ofstuff it was a good time to dive in.

The goal is simple enough: one of the most common tasks I need toperform with my data is querying specific fields for criteria, and ifthere's a match then assembling the data from a given set of fields fordisplay in a list to the user.

I've been using simple tab-delimited lists for this data because it wasabout as compact as it could be and performs reasonably well. But withmulti-dimensional arrays, the question is whether Rev's fast hash indexinto array data might help me gather the data from specific fields ineach record faster than using chunk expressions.


So I took a minute to put together a simple test stack:
<http://fourthworldlabs.com/rev/speed%20test%20chunks%20vs%20arrays.rev.zip>

It has a field containing a list of contact info, another field fordisplaying the test results, and a third for displaying the gathereddata from the query so I can verify that it's doing what I want it to.

If you download the stack you'll see that in addition to the "Test"button there's another one there which converts the list data into anarray and stores that in a custom property of the field, needed fortesting the array method.

The code for the "Test" button is below, and I would appreciate anyonehere who has the time to look it over and see what I may have missed,because the results I'm getting are not what I expected.

The test script is typical of many of the tasks I need to perform onthis data: it checks one field to see if it contains a value, checksanother to see if it contains a different value, and if both are true itcollects data from three fields into a tab- and return-delimited list soI can drop it into a list field to display the output.

I had assumed that using chunk expressions to access items in each linewould be slower than using array notation to get them through the hashin the array. But instead here's the result I'm getting (times are inmilliseconds for 100 iterations):


GetFromList: 72
GetFromSubArray: 752
GetFromMainArray: 407
All results the same?: true

As noted in the code below, the GetFromList handler uses simple chunkexpressions to parse the data; GetFromSubArray uses "repeat for eachelement" to parse out the second-tier array within each record;GetFromMainArray walks through the keys to get the data from the mainarray by addressing both dimensions; the last line simply lets me knowthat all three are returning the same result.

I can understand why GetFromSubArray is the slowest, since it has toinstantiate an array for the second-tier array each time through theloop (using "repeat for each element...").

But I had hoped that accessing the main array by specifying the elementsin both dimensions would get to the data more quickly than would beneeded when asking the engine to count items, but apparently not.

Of course there is a scaling issue with chunk expressions. In my sampledata there are only eight items in each record, but if there wereseveral hundred I would imagine it wouldn't perform as well as the arraymethods. But in my case most of the data I work with has fewer than 30fields and since chunk expressions are measuring about five times fasterI would expect I'd need many more than that before chunk expressionsdrop below arrays in relative performance.

The same could be said of the size of the data within each item, sincethat will adversely affect the time the engine needs to walk through itlooking for item delimiters. But again, it's not often that I havefield data that's very long (the contact list is a good example, inwhich the longest field data is under 200 chars), and the engine'sseeking of delimiters seems reasonably efficient.

Another minor drawback to arrays for this sort of thing is what it doesto the size of the data, esp. if you use meaningful names for yourfields rather than just numbers: while simple tab-delimited data needsonly one set of field names and a function to translate those into itemnumbers before dropping into any loop that uses them, arrays replicatethe field names for each record. The longer the field names, the moreimpact this will have on data size.

I'd be happy to accept this as a trade-off if the speed justified it,but in my test the speed benefit just isn't there for this type ofquerying task.

Any thoughts on how I might optimize the array functions? Did I justoverlook something obvious which would make them run faster?


--
 Richard Gaskin
 Fourth World
 Revolution training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com


------ code for Test button -------------

on mouseUp
  put empty into fld "Results"
  put empty into fld "Output"
  wait 0 with messages -- so clearing fields is visible
  --
  -- Number of iterations:
  put 100 into n
  --
  -- Test simple chunk expressions:
  put the millisecs into t
  repeat n
    put GetFromList() into tResult1
  end repeat
  put the millisecs - t into t1
  --
  -- Test sub-array method:
  put the millisecs into t
  repeat n
    put GetFromSubArray() into tResult2
  end repeat
  put the millisecs - t into t2
  --
  -- Test main array method:
  put the millisecs into t
  repeat n
    put GetFromMainArray() into tResult3
  end repeat
  put the millisecs - t into t3
  --
  -- Display results:
  put tResult2 into fld "Output"
  put "GetFromList: "&t1 &cr\
      &"GetFromSubArray: "&t2 &cr\
      &"GetFromMainArray: "& t3 &cr\
      & "All results the same?: "&\
         ((tResult1 = tResult2) AND (tResult2 = tResult3)) \
      into fld "Results"
end mouseUp


--
-- GetFromList
-- Uses simple item and line chunk expressions to access specific data
-- elements.
--
function GetFromList
  put empty into tList
  set the itemdel to tab
  put fld 1 into tData
  repeat for each line tLine in tData
    if item 1 of tLine contains "A" and item 4 of tLine contains "4" then
      put item 1 of tLine &tab& item 3 of tLine &tab& item 5 of tLine \
       &cr after tList
    end if
  end repeat
  delete last char of tList -- trailing CR
  sort lines of tList
  return tList
end GetFromList


--
-- GetFromSubArray
-- Walks through each element in the two-dimensional array tDataA
-- by putting each into a sub-array tRecordA, then accesses specific
-- elements of tRecordA to get specific field data.
--
function GetFromSubArray
  put empty into tList
  put the uDataA of fld 1 into tDataA
  repeat for each element tRecordA in tDataA
    if tRecordA[1] contains "A" and tRecordA[4] contains "4" then
      put tRecordA[1] &tab& tRecordA[3] &tab& tRecordA[5] &\
       cr after tList
    end if
  end repeat
  delete last char of tList -- trailing CR
  sort lines of tList
  return tList
end GetFromSubArray


--
-- GetFromMainArray
-- First obtains a list of keys of the main array, and uses these
-- to walk through it, accessing specific elements by that main
-- key and a sub-key.
--
function GetFromMainArray
  put empty into tList
  put the uDataA of fld 1 into tDataA
  repeat for each key tKey in tDataA
    if tDataA[tKey][1] contains "A" and tDataA[tKey][4] contains "4" then
      put tDataA[tKey][1] &tab& tDataA[tKey][3] &tab&\
        tDataA[tKey][5] &cr after tList
    end if
  end repeat
  delete last char of tList -- trailing CR
  sort lines of tList
  return tList
end GetFromMainArray

###
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Chunks vs Arrays - surprising benchmarking results

Reply via email to