The dictionary's basic structure includes the number of items, each items length in sequential order, a stored list of items offsets, the data for each item.
Dictionary int itemCount int[] itemLengths int[] sortedItemOffsets byte[] data The basic implementation writes the list in the order a lookup was requested and the result was not found. Each item lookup performed a binary search over the sorted results. This morning I switched the binary search out for a sorted hash map (java TreeMap). Now each record gets added to the TreeMap which has a constant lookup. The final structure does not change, but the lookups are much faster. Here are the new numbers for the two filter queries. Query q00 (500mb) --------------- 2m11.937s Saxon 3m28.708s VXQuery - 1 partition 1m47.574s VXQuery - 2 partitions 1m17.049s VXQuery - 4 partitions Query q01 (500mb) --------------- 2m07.096s Saxon 3m25.250s VXQuery - 1 partition 1m45.465s VXQuery - 2 partitions 1m10.984s VXQuery - 4 partitions On Mon, Feb 17, 2014 at 8:55 AM, Michael Carey <[email protected]> wrote: > How is the dictionary implemented - and moreover, how is it sized? Just > curious. > > > On 2/16/14 10:22 PM, Eldon Carman wrote: > >> I completed an alternate method for creating our XML documents in VXQuery. >> The data is stored on a buffer with the structure stored on a set of >> integer arrays. Once all document has been completely parsed, the >> information is copied over to the output format. The process only copies >> the information once. The new process cuts out copies for each layer of >> children nodes. The new single copy process is slightly slower than the >> current method for our test tree. First off the tree is fairly shallow. In >> addition the write copies are about the same for these files. >> >> The parser has two types of writes. Write int and write bytes. The write >> int is adding data while write bytes is coping the existing data. The >> attached profiler views show how each query spends its time in the >> process. >> Most of the time is writing new data, or searching for existing data node >> names or namespace entries in the dictionary. >> >> The profile seems to show different time measurements between the two test >> while actually taking a similar amount of time to run the query. Any ideas >> why? The files should be looked at for the percentages of time spent. >> >> https://www.dropbox.com/s/ta4emog7z9fxzmc/new_single_copy.png >> https://www.dropbox.com/s/m0pc8q7lt3rih55/child_copy.png >> >> >
