Hi Preston,

your comments are a little too brief for me to understand what's happening
by just reading the diffs :)
Can I look at the whole implementation in your branch? Which one is it?

Thanks,
Till



On Mon, Feb 17, 2014 at 10:24 AM, Eldon Carman <[email protected]> wrote:

> sortedItemOffsets is a list of item indexes in their sorted order.
>
> Also a note on information stored. All the data items are strings. Each
> strings holds the local name, prefix code, or URI for each xml element or
> attribute.
>
>
> On Mon, Feb 17, 2014 at 9:56 AM, Eldon Carman <[email protected]> wrote:
>
> > The dictionary's basic structure includes the number of items, each items
> > length in sequential order, a stored list of items offsets, the data for
> > each item.
> >
> > Dictionary
> > int itemCount
> > int[] itemLengths
> > int[] sortedItemOffsets
> > byte[] data
> >
> > The basic implementation writes the list in the order a lookup was
> > requested and the result was not found. Each item lookup performed a
> binary
> > search over the sorted results.
> > This morning I switched the binary search out for a sorted hash map (java
> > TreeMap). Now each record gets added to the TreeMap which has a constant
> > lookup. The final structure does not change, but the lookups are much
> > faster. Here are the new numbers for the two filter queries.
> >
> > Query q00 (500mb)
> > ---------------
> > 2m11.937s Saxon
> > 3m28.708s VXQuery - 1 partition
> > 1m47.574s VXQuery - 2 partitions
> > 1m17.049s VXQuery - 4 partitions
> >
> > Query q01 (500mb)
> > ---------------
> > 2m07.096s Saxon
> > 3m25.250s VXQuery - 1 partition
> > 1m45.465s VXQuery - 2 partitions
> > 1m10.984s VXQuery - 4 partitions
> >
> >
> >
> > On Mon, Feb 17, 2014 at 8:55 AM, Michael Carey <[email protected]
> >wrote:
> >
> >> How is the dictionary implemented - and moreover, how is it sized? Just
> >> curious.
> >>
> >>
> >> On 2/16/14 10:22 PM, Eldon Carman wrote:
> >>
> >>> I completed an alternate method for creating our XML documents in
> >>> VXQuery.
> >>> The data is stored on a buffer with the structure stored on a set of
> >>> integer arrays. Once all document has been completely parsed, the
> >>> information is copied over to the output format. The process only
> copies
> >>> the information once. The new process cuts out copies for each layer of
> >>> children nodes. The new single copy process is slightly slower than the
> >>> current method for our test tree. First off the tree is fairly shallow.
> >>> In
> >>> addition the write copies are about the same for these files.
> >>>
> >>> The parser has two types of writes. Write int and write bytes. The
> write
> >>> int is adding data while write bytes is coping the existing data. The
> >>> attached profiler views show how each query spends its time in the
> >>> process.
> >>> Most of the time is writing new data, or searching for existing data
> node
> >>> names or namespace entries in the dictionary.
> >>>
> >>> The profile seems to show different time measurements between the two
> >>> test
> >>> while actually taking a similar amount of time to run the query. Any
> >>> ideas
> >>> why? The files should be looked at for the percentages of time spent.
> >>>
> >>> https://www.dropbox.com/s/ta4emog7z9fxzmc/new_single_copy.png
> >>> https://www.dropbox.com/s/m0pc8q7lt3rih55/child_copy.png
> >>>
> >>>
> >>
> >
>

Reply via email to