Re: Alternate Parser Implementation

Eldon Carman Mon, 17 Feb 2014 10:32:34 -0800

sortedItemOffsets is a list of item indexes in their sorted order.

Also a note on information stored. All the data items are strings. Each
strings holds the local name, prefix code, or URI for each xml element or
attribute.



On Mon, Feb 17, 2014 at 9:56 AM, Eldon Carman <[email protected]> wrote:

> The dictionary's basic structure includes the number of items, each items
> length in sequential order, a stored list of items offsets, the data for
> each item.
>
> Dictionary
> int itemCount
> int[] itemLengths
> int[] sortedItemOffsets
> byte[] data
>
> The basic implementation writes the list in the order a lookup was
> requested and the result was not found. Each item lookup performed a binary
> search over the sorted results.
> This morning I switched the binary search out for a sorted hash map (java
> TreeMap). Now each record gets added to the TreeMap which has a constant
> lookup. The final structure does not change, but the lookups are much
> faster. Here are the new numbers for the two filter queries.
>
> Query q00 (500mb)
> ---------------
> 2m11.937s Saxon
> 3m28.708s VXQuery - 1 partition
> 1m47.574s VXQuery - 2 partitions
> 1m17.049s VXQuery - 4 partitions
>
> Query q01 (500mb)
> ---------------
> 2m07.096s Saxon
> 3m25.250s VXQuery - 1 partition
> 1m45.465s VXQuery - 2 partitions
> 1m10.984s VXQuery - 4 partitions
>
>
>
> On Mon, Feb 17, 2014 at 8:55 AM, Michael Carey <[email protected]>wrote:
>
>> How is the dictionary implemented - and moreover, how is it sized? Just
>> curious.
>>
>>
>> On 2/16/14 10:22 PM, Eldon Carman wrote:
>>
>>> I completed an alternate method for creating our XML documents in
>>> VXQuery.
>>> The data is stored on a buffer with the structure stored on a set of
>>> integer arrays. Once all document has been completely parsed, the
>>> information is copied over to the output format. The process only copies
>>> the information once. The new process cuts out copies for each layer of
>>> children nodes. The new single copy process is slightly slower than the
>>> current method for our test tree. First off the tree is fairly shallow.
>>> In
>>> addition the write copies are about the same for these files.
>>>
>>> The parser has two types of writes. Write int and write bytes. The write
>>> int is adding data while write bytes is coping the existing data. The
>>> attached profiler views show how each query spends its time in the
>>> process.
>>> Most of the time is writing new data, or searching for existing data node
>>> names or namespace entries in the dictionary.
>>>
>>> The profile seems to show different time measurements between the two
>>> test
>>> while actually taking a similar amount of time to run the query. Any
>>> ideas
>>> why? The files should be looked at for the percentages of time spent.
>>>
>>> https://www.dropbox.com/s/ta4emog7z9fxzmc/new_single_copy.png
>>> https://www.dropbox.com/s/m0pc8q7lt3rih55/child_copy.png
>>>
>>>
>>
>

Re: Alternate Parser Implementation

Reply via email to