Re: Chunks vs Arrays - surprising benchmarking results

Richard Gaskin Fri, 07 Aug 2009 11:16:02 -0700

Kevin Miller wrote:

On 05/08/2009 21:02, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

Excellent sleuthing, Trevor.   Confirmed:  with that change I'm getting
the same results.  Who would have thought there could be so much
overhead moving a custom property array into a variable array?


Bear in mind that when retrieving a custom property, the engine has to look
up whether there is a getProp handler for it. Locking messages should give
you an improvement.

In my tests here the difference was measurable but minor, but then againI have no getProp handlers in the message path for these tests.

The main difference between arrays and chunks is that arrays will continue
to access huge data sets in linear time, whereas chunks (depending on
exactly what you're doing) will slow down as the data set gets larger.

Very true, though the difference becomes most significant only when thedata is large enough that it may be a good time to consider SQLite overstack files for storage anyway. :)



In my tests I ran three sets of data:

small: 5000 records, with 50 items in each record, with 50 chars in eachitem

medium: 10,000 records with 100 items in each record, with 100 chars ineach item

large: 10,000 records with 100 items in each record, with 200 chars ineach item

I put both the delimited and array data into the same stack for each,giving me a size for the small stack of about 27MB, medium was about204MB, and large was over 408MB.

The small data set performed well with both methods, as did mediumalthough the medium stack took some time to load (confirming my hunchthat 100MB of data is probably a good boundary to consider using SQLiteover stack files, but fortunately I'm using this just for document filesso it's unlikely I'll ever reach even half that).

The large data set could be created and saved, but the resulting stackcould not be opened; no corruption warning, just wouldn't open. Have Idiscovered an undocumented limitation?

The results were as we would expect: as the data grows in size,performance of the array search method scales linearly, whileperformance for chunk expressions degrade in logarithmic proportion todata length.

Even so, chunk expressions consistently outperformed arrays in testswhich included loading the data from properties.

When I altered the test to preload the data into vars before testing,the difference in performance was just under an order of magnitudegreater in favor of arrays.

While this means changing my setup to preload data when documents arefirst opened, this one-time hit is more than compensated by ongoingperformance enhancement for nearly all other operations.

So I'm strongly favoring arrays for this sort of thing, but it would benice to have three enhancements in the engine to make it even better:


1. faster load time
-------------------

Can the operation which moves data from array custom props into arrayvariables be optimized to reduce the surprising amount of time it takes?Grabbing a non-array property is nearly as fast as accessing a global;it'd be nice if accessing array props were at least a bit closer to thatstellar performance.


2. operate directly on properties
---------------------------------

It would be very handy if we could use the same array syntax to workwith properties as we can with variables. Before multi-dimensionalarrays there was an enjoyable, learnable, and efficient parity in thesyntax used for arrays in both vars and props, and I miss that whenworking with nested arrays.



3. reduce data redundancy in keys
---------------------------------

Given that Rev's arrays are associative every element is a name-valuepair, so in addition to storing the value it needs to also store thename as its key. This is necessary because for all the engines knowsevery array may contain unique keys, but when making nested arrays inwhich the inner arrays are all uniform the replicated key names justtake up space.

For example, with the Congress contact info I used in my originalexample, it's only 530 lines with less than 1/2k per line, so tuckingthat into a property in a new stack gives me a size for that stack fileof about 68k.

But when I make an array version of that data and store that into aproperty in another stack, using meaningful names for those elements(e.g., "Name", "Address", "Telephone", etc.) brings that stack size upto 132k - more than double.

So I was daydreaming: what if we could tell the engine that a givenarray is to be populated only with sub-arrays in which all keys willalways be the same?

Imagine being able to define something like a struct, a definition of anarray which could be assigned to another array so that parent arraywould be able to store the data more efficiently, tucking only the datainto its nifty hash without having to replicate the field names forevery element.

I would imagine such a struct-like thing would have a great many uses,in addition to reducing memory and storage requirements for uniformarrays as elements of a parent array.


Doable?  By Tuesday, perhaps? :)

--
 Richard Gaskin
 Fourth World
 Revolution training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Chunks vs Arrays - surprising benchmarking results

Reply via email to