Stephen Allen wrote:
> Hi Paolo,
> 
> I was thinking along the same lines in terms of unifying the patches.  
> Designing an interface inspired by Pig's DataBags seems to make sense.  We 
> would need three bag implementations (unorder, sorted, and distinct).  A good 
> starting point would be to have each bag proactively spill when thresholds 
> are passed rather than an external memory manager.  Pig does in fact have 
> these (InternalCachedBag, InternalSortedBag, and InternalDistinctBag [1]).
> 
> I'm not completely happy with the implementation I have for JENA-45.  I'd 
> like to redesign it a bit, as well as unify JENA-44.  The design will be 
> guided by Pig, but with some simplifications.
> 
> 1) Create implementations for the 3 bag types
> 2) Bags will be generic and accept serializer objects (to handle the 
> different tuple types: Bindings, Triples, and Quads)
> 3) Bags will proactively spill (this greatly simplifies things because there 
> is no need to deal with synchronization)
> 4) Spill based on the estimated memory size of the tuples [2] instead of just 
> the cardinality (the serializer object can generate the estimate)
> 
> My plan is to try to work on it early next week.

Hi Stephen,
thanks for your response.

I think we should create a new JIRA issue (which JENA-44 and JENA-45 depends 
on),
just to work on these three types of bags (i.e. unsorted, sorted and distinct)
which will spill on disk once they reach a threshold.

Do we need sorted+distinct as well?

If we create a new issue we can work together on that. Hopefully commit it 
quickly
and then make progress on JENA-44 and JENA-45 in parallel and independently.

Paolo

> 
> -Stephen
> 
> [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html
> [2] I plan to estimate the size of the bindings/triples/quads by examining 
> the values and retrieving their string lengths.  This will prevent costly 
> serialization until we actually have to spill.  We could instead get an 
> average tuple size based on say the first 100 items added to the bag.
> 
> 
> -----Original Message-----
> From: Paolo Castagna [mailto:[email protected]] 
> Sent: Friday, August 12, 2011 9:30 AM
> To: [email protected]
> Subject: Re: JENA-44, JENA-45 etc - common Binding I/O
> 
> Hi Andy
> first of all, apologies for the late reply on this.
> 
> Andy Seaborne wrote:
>>
>> On 23/06/11 17:07, Paolo Castagna wrote:
>>> Hi Andy,
>>> first of all, thanks for this.
>>>
>>> Re: JENA-44... what is blocking JENA-44 going into trunk is just the
>>> lack of a
>>> common way to serialize binding. By the way, we are using a patched
>>> version of
>>> ARQ on some of our servers (with no problem and improvements in terms of
>>> stability, RAM consumption in particular when users submit queries which
>>> need
>>> to sort large resultsets and they timeout).
>>>
>>> So, all this is more than welcome from my point of view (i.e. one patch
>>> less
>>> to manage).
>> Have you looked at the DeferredFileQueue / ThresholdPolicy code in
>> JENA-45?  This is another area of commonality.
> 
> ThresholdPolicyCount can be used for JENA-44 as well.
> 
> Maybe the stuff org.openjena.atlas.io.* and org.openjena.riot.* from
> JENA-45 can be committed so that it can be used for JENA-44 as well.
> 
> However, DeferredFileQueue does not currently provide any way to sort the
> items before spilling them to disk. So, we would need something similar
> but a DeferredSortingFileQueue. Do you agree?
> 
> What we do in ExternalBindingSort is to buffer a certain number of bindings
> (by default 4000), we sort them and write them to disk. Then we repeat with
> the next 4000.
> 
>> Any thoughts about
>> DataBag from Pig?  (JENA-44, comment 24/May/11, pt 3 - this mikght be
>> too much for this round).
> 
> Something similar to SortedDataBag is what's needed for JENA-44.
> DeferredFileQueue from JENA-45 is similar to DefaultAbstractBag.
> The biggest difference is that Pig uses a SpillableMemoryManager instead
> of fixed thresholds.
> 
> We could start committing JENA-45 and JENA-44 as they are (or with minimal
> changes) with fixed and sensible thresholds and configuration parameters.
> 
> Then we could discuss a more general memory manager system which would need
> to control when to spill to disk. But, I don't see this as a blocker for
> JENA-45 nor JENA-44.
> 
> The DataBag hierarchy from Pig is something we can be inspired by (i.e.
> copy ideas) but the code would need to be changed a lot to adapt to our
> needs.
> 
>> There are various settable paramters - what makes a difference?
>> especially writeBufferSize.
> 
> ExternalBindingSort.java has the following settable parameters:
> 
>   externalSortBufferSize (default value is 4000)
>   externalSortWorkers (default value is 1)
>   externalSortDir (default to what specified by java.io.tmpdir)
> 
> writeBufferSize is set to 10MB.
> Maybe we should make that configurable as well.
> 
> Also, 10MB perhaps is too high with a externalSortBufferSize of only 4000 
> bindings.
> 
> The aim of having all these parameters configurable via ARQ's symbols is to
> allow people to make experiments and find the optimal configuration for their
> systems.
> 
> I seem to remember a spreadsheet with a few experiments but it could be 
> something
> unrelated to these parameters. In any case I can add a sort of 
> micro-benchmark for
> this to the src-dev area as part of JENA-44.
> 
>> I didn't notice how cancellation would stop executors, only clear up
>> afterwards.  What about a volatile flag?
> 
> Right. Once executors start they run to completion. However, we create a new
> executor every externalSortBufferSize (by default 4000) bindings and only if
> the iterator has not being canceled.
> 
> Yes, we can add a flag to stop executors immediately as soon as the iterator
> gets canceled.
> 
> Paolo
> 
>>
>>>> VARS ?x ?y .
>>>>
>>>> Set the variables in force for subsequent rows,
>>>> until the next VARS directive.
>>>> We need VARS because it's not always possible to determine all
>>>> the possible variables before starting to write out bindings.
>>> This is not completely clear to me. An example of when it's not possible
>>> to determine all the possible variables before starting to write out
>>> binding
>>> will probably convince me and help me to clarify.
>> Support you have an Iterator<Binding> from a LeftJoin or a Union.  One
>> way is to statically determine the variables, the other is to be relaxed
>> and output based on the Bindings seen.  Static analysis requires the
>> info to be passed from query execution into, for example, the heart of
>>
>> The first might have ?x, ?z, the second ?x, ?y, ?z from an OPTIONAL. The
>> separation of the code from the static analysis
>>
>> If you set it once at the start, that also works.
>>
>> And you can concat streams.
>>
>>     Andy
> 

Reply via email to