Hi Paolo,

I was thinking along the same lines in terms of unifying the patches.  
Designing an interface inspired by Pig's DataBags seems to make sense.  We 
would need three bag implementations (unorder, sorted, and distinct).  A good 
starting point would be to have each bag proactively spill when thresholds are 
passed rather than an external memory manager.  Pig does in fact have these 
(InternalCachedBag, InternalSortedBag, and InternalDistinctBag [1]).

I'm not completely happy with the implementation I have for JENA-45.  I'd like 
to redesign it a bit, as well as unify JENA-44.  The design will be guided by 
Pig, but with some simplifications.

1) Create implementations for the 3 bag types
2) Bags will be generic and accept serializer objects (to handle the different 
tuple types: Bindings, Triples, and Quads)
3) Bags will proactively spill (this greatly simplifies things because there is 
no need to deal with synchronization)
4) Spill based on the estimated memory size of the tuples [2] instead of just 
the cardinality (the serializer object can generate the estimate)

My plan is to try to work on it early next week.

-Stephen

[1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html
[2] I plan to estimate the size of the bindings/triples/quads by examining the 
values and retrieving their string lengths.  This will prevent costly 
serialization until we actually have to spill.  We could instead get an average 
tuple size based on say the first 100 items added to the bag.


-----Original Message-----
From: Paolo Castagna [mailto:[email protected]] 
Sent: Friday, August 12, 2011 9:30 AM
To: [email protected]
Subject: Re: JENA-44, JENA-45 etc - common Binding I/O

Hi Andy
first of all, apologies for the late reply on this.

Andy Seaborne wrote:
> 
> 
> On 23/06/11 17:07, Paolo Castagna wrote:
>> Hi Andy,
>> first of all, thanks for this.
>>
>> Re: JENA-44... what is blocking JENA-44 going into trunk is just the
>> lack of a
>> common way to serialize binding. By the way, we are using a patched
>> version of
>> ARQ on some of our servers (with no problem and improvements in terms of
>> stability, RAM consumption in particular when users submit queries which
>> need
>> to sort large resultsets and they timeout).
>>
>> So, all this is more than welcome from my point of view (i.e. one patch
>> less
>> to manage).
> 
> Have you looked at the DeferredFileQueue / ThresholdPolicy code in
> JENA-45?  This is another area of commonality.

ThresholdPolicyCount can be used for JENA-44 as well.

Maybe the stuff org.openjena.atlas.io.* and org.openjena.riot.* from
JENA-45 can be committed so that it can be used for JENA-44 as well.

However, DeferredFileQueue does not currently provide any way to sort the
items before spilling them to disk. So, we would need something similar
but a DeferredSortingFileQueue. Do you agree?

What we do in ExternalBindingSort is to buffer a certain number of bindings
(by default 4000), we sort them and write them to disk. Then we repeat with
the next 4000.

> Any thoughts about
> DataBag from Pig?  (JENA-44, comment 24/May/11, pt 3 - this mikght be
> too much for this round).

Something similar to SortedDataBag is what's needed for JENA-44.
DeferredFileQueue from JENA-45 is similar to DefaultAbstractBag.
The biggest difference is that Pig uses a SpillableMemoryManager instead
of fixed thresholds.

We could start committing JENA-45 and JENA-44 as they are (or with minimal
changes) with fixed and sensible thresholds and configuration parameters.

Then we could discuss a more general memory manager system which would need
to control when to spill to disk. But, I don't see this as a blocker for
JENA-45 nor JENA-44.

The DataBag hierarchy from Pig is something we can be inspired by (i.e.
copy ideas) but the code would need to be changed a lot to adapt to our
needs.

> There are various settable paramters - what makes a difference?
> especially writeBufferSize.

ExternalBindingSort.java has the following settable parameters:

  externalSortBufferSize (default value is 4000)
  externalSortWorkers (default value is 1)
  externalSortDir (default to what specified by java.io.tmpdir)

writeBufferSize is set to 10MB.
Maybe we should make that configurable as well.

Also, 10MB perhaps is too high with a externalSortBufferSize of only 4000 
bindings.

The aim of having all these parameters configurable via ARQ's symbols is to
allow people to make experiments and find the optimal configuration for their
systems.

I seem to remember a spreadsheet with a few experiments but it could be 
something
unrelated to these parameters. In any case I can add a sort of micro-benchmark 
for
this to the src-dev area as part of JENA-44.

> I didn't notice how cancellation would stop executors, only clear up
> afterwards.  What about a volatile flag?

Right. Once executors start they run to completion. However, we create a new
executor every externalSortBufferSize (by default 4000) bindings and only if
the iterator has not being canceled.

Yes, we can add a flag to stop executors immediately as soon as the iterator
gets canceled.

Paolo

> 
> 
>>> VARS ?x ?y .
>>>
>>> Set the variables in force for subsequent rows,
>>> until the next VARS directive.
>>> We need VARS because it's not always possible to determine all
>>> the possible variables before starting to write out bindings.
>>
>> This is not completely clear to me. An example of when it's not possible
>> to determine all the possible variables before starting to write out
>> binding
>> will probably convince me and help me to clarify.
> 
> Support you have an Iterator<Binding> from a LeftJoin or a Union.  One
> way is to statically determine the variables, the other is to be relaxed
> and output based on the Bindings seen.  Static analysis requires the
> info to be passed from query execution into, for example, the heart of
> 
> The first might have ?x, ?z, the second ?x, ?y, ?z from an OPTIONAL. The
> separation of the code from the static analysis
> 
> If you set it once at the start, that also works.
> 
> And you can concat streams.
> 
>     Andy

Reply via email to