The "Multi dimensional array filtering" thread reminded me of a benchmarking test I've been wanting to do for some time, and since I have some tasks coming up on a client project which needs this sort of stuff it was a good time to dive in.

The goal is simple enough: one of the most common tasks I need to perform with my data is querying specific fields for criteria, and if there's a match then assembling the data from a given set of fields for display in a list to the user.

I've been using simple tab-delimited lists for this data because it was about as compact as it could be and performs reasonably well. But with multi-dimensional arrays, the question is whether Rev's fast hash index into array data might help me gather the data from specific fields in each record faster than using chunk expressions.

So I took a minute to put together a simple test stack:
<http://fourthworldlabs.com/rev/speed%20test%20chunks%20vs%20arrays.rev.zip>

It has a field containing a list of contact info, another field for displaying the test results, and a third for displaying the gathered data from the query so I can verify that it's doing what I want it to.

If you download the stack you'll see that in addition to the "Test" button there's another one there which converts the list data into an array and stores that in a custom property of the field, needed for testing the array method.

The code for the "Test" button is below, and I would appreciate anyone here who has the time to look it over and see what I may have missed, because the results I'm getting are not what I expected.

The test script is typical of many of the tasks I need to perform on this data: it checks one field to see if it contains a value, checks another to see if it contains a different value, and if both are true it collects data from three fields into a tab- and return-delimited list so I can drop it into a list field to display the output.

I had assumed that using chunk expressions to access items in each line would be slower than using array notation to get them through the hash in the array. But instead here's the result I'm getting (times are in milliseconds for 100 iterations):

GetFromList: 72
GetFromSubArray: 752
GetFromMainArray: 407
All results the same?: true

As noted in the code below, the GetFromList handler uses simple chunk expressions to parse the data; GetFromSubArray uses "repeat for each element" to parse out the second-tier array within each record; GetFromMainArray walks through the keys to get the data from the main array by addressing both dimensions; the last line simply lets me know that all three are returning the same result.

I can understand why GetFromSubArray is the slowest, since it has to instantiate an array for the second-tier array each time through the loop (using "repeat for each element...").

But I had hoped that accessing the main array by specifying the elements in both dimensions would get to the data more quickly than would be needed when asking the engine to count items, but apparently not.

Of course there is a scaling issue with chunk expressions. In my sample data there are only eight items in each record, but if there were several hundred I would imagine it wouldn't perform as well as the array methods. But in my case most of the data I work with has fewer than 30 fields and since chunk expressions are measuring about five times faster I would expect I'd need many more than that before chunk expressions drop below arrays in relative performance.

The same could be said of the size of the data within each item, since that will adversely affect the time the engine needs to walk through it looking for item delimiters. But again, it's not often that I have field data that's very long (the contact list is a good example, in which the longest field data is under 200 chars), and the engine's seeking of delimiters seems reasonably efficient.

Another minor drawback to arrays for this sort of thing is what it does to the size of the data, esp. if you use meaningful names for your fields rather than just numbers: while simple tab-delimited data needs only one set of field names and a function to translate those into item numbers before dropping into any loop that uses them, arrays replicate the field names for each record. The longer the field names, the more impact this will have on data size.

I'd be happy to accept this as a trade-off if the speed justified it, but in my test the speed benefit just isn't there for this type of querying task.

Any thoughts on how I might optimize the array functions? Did I just overlook something obvious which would make them run faster?

--
 Richard Gaskin
 Fourth World
 Revolution training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com


------ code for Test button -------------

on mouseUp
  put empty into fld "Results"
  put empty into fld "Output"
  wait 0 with messages -- so clearing fields is visible
  --
  -- Number of iterations:
  put 100 into n
  --
  -- Test simple chunk expressions:
  put the millisecs into t
  repeat n
    put GetFromList() into tResult1
  end repeat
  put the millisecs - t into t1
  --
  -- Test sub-array method:
  put the millisecs into t
  repeat n
    put GetFromSubArray() into tResult2
  end repeat
  put the millisecs - t into t2
  --
  -- Test main array method:
  put the millisecs into t
  repeat n
    put GetFromMainArray() into tResult3
  end repeat
  put the millisecs - t into t3
  --
  -- Display results:
  put tResult2 into fld "Output"
  put "GetFromList: "&t1 &cr\
      &"GetFromSubArray: "&t2 &cr\
      &"GetFromMainArray: "& t3 &cr\
      & "All results the same?: "&\
         ((tResult1 = tResult2) AND (tResult2 = tResult3)) \
      into fld "Results"
end mouseUp


--
-- GetFromList
-- Uses simple item and line chunk expressions to access specific data
-- elements.
--
function GetFromList
  put empty into tList
  set the itemdel to tab
  put fld 1 into tData
  repeat for each line tLine in tData
    if item 1 of tLine contains "A" and item 4 of tLine contains "4" then
      put item 1 of tLine &tab& item 3 of tLine &tab& item 5 of tLine \
       &cr after tList
    end if
  end repeat
  delete last char of tList -- trailing CR
  sort lines of tList
  return tList
end GetFromList


--
-- GetFromSubArray
-- Walks through each element in the two-dimensional array tDataA
-- by putting each into a sub-array tRecordA, then accesses specific
-- elements of tRecordA to get specific field data.
--
function GetFromSubArray
  put empty into tList
  put the uDataA of fld 1 into tDataA
  repeat for each element tRecordA in tDataA
    if tRecordA[1] contains "A" and tRecordA[4] contains "4" then
      put tRecordA[1] &tab& tRecordA[3] &tab& tRecordA[5] &\
       cr after tList
    end if
  end repeat
  delete last char of tList -- trailing CR
  sort lines of tList
  return tList
end GetFromSubArray


--
-- GetFromMainArray
-- First obtains a list of keys of the main array, and uses these
-- to walk through it, accessing specific elements by that main
-- key and a sub-key.
--
function GetFromMainArray
  put empty into tList
  put the uDataA of fld 1 into tDataA
  repeat for each key tKey in tDataA
    if tDataA[tKey][1] contains "A" and tDataA[tKey][4] contains "4" then
      put tDataA[tKey][1] &tab& tDataA[tKey][3] &tab&\
        tDataA[tKey][5] &cr after tList
    end if
  end repeat
  delete last char of tList -- trailing CR
  sort lines of tList
  return tList
end GetFromMainArray

###
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to