The dictionary's basic structure includes the number of items, each items
length in sequential order, a stored list of items offsets, the data for
each item.

Dictionary
int itemCount
int[] itemLengths
int[] sortedItemOffsets
byte[] data

The basic implementation writes the list in the order a lookup was
requested and the result was not found. Each item lookup performed a binary
search over the sorted results.
This morning I switched the binary search out for a sorted hash map (java
TreeMap). Now each record gets added to the TreeMap which has a constant
lookup. The final structure does not change, but the lookups are much
faster. Here are the new numbers for the two filter queries.

Query q00 (500mb)
---------------
2m11.937s Saxon
3m28.708s VXQuery - 1 partition
1m47.574s VXQuery - 2 partitions
1m17.049s VXQuery - 4 partitions

Query q01 (500mb)
---------------
2m07.096s Saxon
3m25.250s VXQuery - 1 partition
1m45.465s VXQuery - 2 partitions
1m10.984s VXQuery - 4 partitions



On Mon, Feb 17, 2014 at 8:55 AM, Michael Carey <[email protected]> wrote:

> How is the dictionary implemented - and moreover, how is it sized? Just
> curious.
>
>
> On 2/16/14 10:22 PM, Eldon Carman wrote:
>
>> I completed an alternate method for creating our XML documents in VXQuery.
>> The data is stored on a buffer with the structure stored on a set of
>> integer arrays. Once all document has been completely parsed, the
>> information is copied over to the output format. The process only copies
>> the information once. The new process cuts out copies for each layer of
>> children nodes. The new single copy process is slightly slower than the
>> current method for our test tree. First off the tree is fairly shallow. In
>> addition the write copies are about the same for these files.
>>
>> The parser has two types of writes. Write int and write bytes. The write
>> int is adding data while write bytes is coping the existing data. The
>> attached profiler views show how each query spends its time in the
>> process.
>> Most of the time is writing new data, or searching for existing data node
>> names or namespace entries in the dictionary.
>>
>> The profile seems to show different time measurements between the two test
>> while actually taking a similar amount of time to run the query. Any ideas
>> why? The files should be looked at for the percentages of time spent.
>>
>> https://www.dropbox.com/s/ta4emog7z9fxzmc/new_single_copy.png
>> https://www.dropbox.com/s/m0pc8q7lt3rih55/child_copy.png
>>
>>
>

Reply via email to