Hi Preston, your comments are a little too brief for me to understand what's happening by just reading the diffs :) Can I look at the whole implementation in your branch? Which one is it?
Thanks, Till On Mon, Feb 17, 2014 at 10:24 AM, Eldon Carman <[email protected]> wrote: > sortedItemOffsets is a list of item indexes in their sorted order. > > Also a note on information stored. All the data items are strings. Each > strings holds the local name, prefix code, or URI for each xml element or > attribute. > > > On Mon, Feb 17, 2014 at 9:56 AM, Eldon Carman <[email protected]> wrote: > > > The dictionary's basic structure includes the number of items, each items > > length in sequential order, a stored list of items offsets, the data for > > each item. > > > > Dictionary > > int itemCount > > int[] itemLengths > > int[] sortedItemOffsets > > byte[] data > > > > The basic implementation writes the list in the order a lookup was > > requested and the result was not found. Each item lookup performed a > binary > > search over the sorted results. > > This morning I switched the binary search out for a sorted hash map (java > > TreeMap). Now each record gets added to the TreeMap which has a constant > > lookup. The final structure does not change, but the lookups are much > > faster. Here are the new numbers for the two filter queries. > > > > Query q00 (500mb) > > --------------- > > 2m11.937s Saxon > > 3m28.708s VXQuery - 1 partition > > 1m47.574s VXQuery - 2 partitions > > 1m17.049s VXQuery - 4 partitions > > > > Query q01 (500mb) > > --------------- > > 2m07.096s Saxon > > 3m25.250s VXQuery - 1 partition > > 1m45.465s VXQuery - 2 partitions > > 1m10.984s VXQuery - 4 partitions > > > > > > > > On Mon, Feb 17, 2014 at 8:55 AM, Michael Carey <[email protected] > >wrote: > > > >> How is the dictionary implemented - and moreover, how is it sized? Just > >> curious. > >> > >> > >> On 2/16/14 10:22 PM, Eldon Carman wrote: > >> > >>> I completed an alternate method for creating our XML documents in > >>> VXQuery. > >>> The data is stored on a buffer with the structure stored on a set of > >>> integer arrays. Once all document has been completely parsed, the > >>> information is copied over to the output format. The process only > copies > >>> the information once. The new process cuts out copies for each layer of > >>> children nodes. The new single copy process is slightly slower than the > >>> current method for our test tree. First off the tree is fairly shallow. > >>> In > >>> addition the write copies are about the same for these files. > >>> > >>> The parser has two types of writes. Write int and write bytes. The > write > >>> int is adding data while write bytes is coping the existing data. The > >>> attached profiler views show how each query spends its time in the > >>> process. > >>> Most of the time is writing new data, or searching for existing data > node > >>> names or namespace entries in the dictionary. > >>> > >>> The profile seems to show different time measurements between the two > >>> test > >>> while actually taking a similar amount of time to run the query. Any > >>> ideas > >>> why? The files should be looked at for the percentages of time spent. > >>> > >>> https://www.dropbox.com/s/ta4emog7z9fxzmc/new_single_copy.png > >>> https://www.dropbox.com/s/m0pc8q7lt3rih55/child_copy.png > >>> > >>> > >> > > >
