Hi Till, I just updated all my branches. "prestonc/benchmarks" has been removed since all the good code from that branch are now in "prestonc/benchmarks_staging". Also all the changes ready for master are on the "prestonc/benchmarks_staging" branch. I rebased those changes to "prestonc/parser" where I wrote an alternate SaxContentHandler to reduce the number of copies of data. I am not sure what specific change you would like to review. The change for the dictionary is on both branches and the alternate SAXContentHandler is only on the parser branch.
Preston Overview "prestonc/benchmarks_staging" -- All changes ready for master. "prestonc/parser" -- All changes ready for master and an alternate SaxContentHandler On Mon, Feb 17, 2014 at 5:20 PM, Till Westmann <[email protected]> wrote: > Hi Preston, > > your comments are a little too brief for me to understand what's happening > by just reading the diffs :) > Can I look at the whole implementation in your branch? Which one is it? > > Thanks, > Till > > > > On Mon, Feb 17, 2014 at 10:24 AM, Eldon Carman <[email protected]> wrote: > > > sortedItemOffsets is a list of item indexes in their sorted order. > > > > Also a note on information stored. All the data items are strings. Each > > strings holds the local name, prefix code, or URI for each xml element or > > attribute. > > > > > > On Mon, Feb 17, 2014 at 9:56 AM, Eldon Carman <[email protected]> wrote: > > > > > The dictionary's basic structure includes the number of items, each > items > > > length in sequential order, a stored list of items offsets, the data > for > > > each item. > > > > > > Dictionary > > > int itemCount > > > int[] itemLengths > > > int[] sortedItemOffsets > > > byte[] data > > > > > > The basic implementation writes the list in the order a lookup was > > > requested and the result was not found. Each item lookup performed a > > binary > > > search over the sorted results. > > > This morning I switched the binary search out for a sorted hash map > (java > > > TreeMap). Now each record gets added to the TreeMap which has a > constant > > > lookup. The final structure does not change, but the lookups are much > > > faster. Here are the new numbers for the two filter queries. > > > > > > Query q00 (500mb) > > > --------------- > > > 2m11.937s Saxon > > > 3m28.708s VXQuery - 1 partition > > > 1m47.574s VXQuery - 2 partitions > > > 1m17.049s VXQuery - 4 partitions > > > > > > Query q01 (500mb) > > > --------------- > > > 2m07.096s Saxon > > > 3m25.250s VXQuery - 1 partition > > > 1m45.465s VXQuery - 2 partitions > > > 1m10.984s VXQuery - 4 partitions > > > > > > > > > > > > On Mon, Feb 17, 2014 at 8:55 AM, Michael Carey <[email protected] > > >wrote: > > > > > >> How is the dictionary implemented - and moreover, how is it sized? > Just > > >> curious. > > >> > > >> > > >> On 2/16/14 10:22 PM, Eldon Carman wrote: > > >> > > >>> I completed an alternate method for creating our XML documents in > > >>> VXQuery. > > >>> The data is stored on a buffer with the structure stored on a set of > > >>> integer arrays. Once all document has been completely parsed, the > > >>> information is copied over to the output format. The process only > > copies > > >>> the information once. The new process cuts out copies for each layer > of > > >>> children nodes. The new single copy process is slightly slower than > the > > >>> current method for our test tree. First off the tree is fairly > shallow. > > >>> In > > >>> addition the write copies are about the same for these files. > > >>> > > >>> The parser has two types of writes. Write int and write bytes. The > > write > > >>> int is adding data while write bytes is coping the existing data. The > > >>> attached profiler views show how each query spends its time in the > > >>> process. > > >>> Most of the time is writing new data, or searching for existing data > > node > > >>> names or namespace entries in the dictionary. > > >>> > > >>> The profile seems to show different time measurements between the two > > >>> test > > >>> while actually taking a similar amount of time to run the query. Any > > >>> ideas > > >>> why? The files should be looked at for the percentages of time spent. > > >>> > > >>> https://www.dropbox.com/s/ta4emog7z9fxzmc/new_single_copy.png > > >>> https://www.dropbox.com/s/m0pc8q7lt3rih55/child_copy.png > > >>> > > >>> > > >> > > > > > >
