Stephen Allen wrote: > Hi Paolo, > > I was thinking along the same lines in terms of unifying the patches. > Designing an interface inspired by Pig's DataBags seems to make sense. We > would need three bag implementations (unorder, sorted, and distinct). A good > starting point would be to have each bag proactively spill when thresholds > are passed rather than an external memory manager. Pig does in fact have > these (InternalCachedBag, InternalSortedBag, and InternalDistinctBag [1]). > > I'm not completely happy with the implementation I have for JENA-45. I'd > like to redesign it a bit, as well as unify JENA-44. The design will be > guided by Pig, but with some simplifications. > > 1) Create implementations for the 3 bag types > 2) Bags will be generic and accept serializer objects (to handle the > different tuple types: Bindings, Triples, and Quads) > 3) Bags will proactively spill (this greatly simplifies things because there > is no need to deal with synchronization) > 4) Spill based on the estimated memory size of the tuples [2] instead of just > the cardinality (the serializer object can generate the estimate) > > My plan is to try to work on it early next week.
Hi Stephen, thanks for your response. I think we should create a new JIRA issue (which JENA-44 and JENA-45 depends on), just to work on these three types of bags (i.e. unsorted, sorted and distinct) which will spill on disk once they reach a threshold. Do we need sorted+distinct as well? If we create a new issue we can work together on that. Hopefully commit it quickly and then make progress on JENA-44 and JENA-45 in parallel and independently. Paolo > > -Stephen > > [1] http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/data/DataBag.html > [2] I plan to estimate the size of the bindings/triples/quads by examining > the values and retrieving their string lengths. This will prevent costly > serialization until we actually have to spill. We could instead get an > average tuple size based on say the first 100 items added to the bag. > > > -----Original Message----- > From: Paolo Castagna [mailto:[email protected]] > Sent: Friday, August 12, 2011 9:30 AM > To: [email protected] > Subject: Re: JENA-44, JENA-45 etc - common Binding I/O > > Hi Andy > first of all, apologies for the late reply on this. > > Andy Seaborne wrote: >> >> On 23/06/11 17:07, Paolo Castagna wrote: >>> Hi Andy, >>> first of all, thanks for this. >>> >>> Re: JENA-44... what is blocking JENA-44 going into trunk is just the >>> lack of a >>> common way to serialize binding. By the way, we are using a patched >>> version of >>> ARQ on some of our servers (with no problem and improvements in terms of >>> stability, RAM consumption in particular when users submit queries which >>> need >>> to sort large resultsets and they timeout). >>> >>> So, all this is more than welcome from my point of view (i.e. one patch >>> less >>> to manage). >> Have you looked at the DeferredFileQueue / ThresholdPolicy code in >> JENA-45? This is another area of commonality. > > ThresholdPolicyCount can be used for JENA-44 as well. > > Maybe the stuff org.openjena.atlas.io.* and org.openjena.riot.* from > JENA-45 can be committed so that it can be used for JENA-44 as well. > > However, DeferredFileQueue does not currently provide any way to sort the > items before spilling them to disk. So, we would need something similar > but a DeferredSortingFileQueue. Do you agree? > > What we do in ExternalBindingSort is to buffer a certain number of bindings > (by default 4000), we sort them and write them to disk. Then we repeat with > the next 4000. > >> Any thoughts about >> DataBag from Pig? (JENA-44, comment 24/May/11, pt 3 - this mikght be >> too much for this round). > > Something similar to SortedDataBag is what's needed for JENA-44. > DeferredFileQueue from JENA-45 is similar to DefaultAbstractBag. > The biggest difference is that Pig uses a SpillableMemoryManager instead > of fixed thresholds. > > We could start committing JENA-45 and JENA-44 as they are (or with minimal > changes) with fixed and sensible thresholds and configuration parameters. > > Then we could discuss a more general memory manager system which would need > to control when to spill to disk. But, I don't see this as a blocker for > JENA-45 nor JENA-44. > > The DataBag hierarchy from Pig is something we can be inspired by (i.e. > copy ideas) but the code would need to be changed a lot to adapt to our > needs. > >> There are various settable paramters - what makes a difference? >> especially writeBufferSize. > > ExternalBindingSort.java has the following settable parameters: > > externalSortBufferSize (default value is 4000) > externalSortWorkers (default value is 1) > externalSortDir (default to what specified by java.io.tmpdir) > > writeBufferSize is set to 10MB. > Maybe we should make that configurable as well. > > Also, 10MB perhaps is too high with a externalSortBufferSize of only 4000 > bindings. > > The aim of having all these parameters configurable via ARQ's symbols is to > allow people to make experiments and find the optimal configuration for their > systems. > > I seem to remember a spreadsheet with a few experiments but it could be > something > unrelated to these parameters. In any case I can add a sort of > micro-benchmark for > this to the src-dev area as part of JENA-44. > >> I didn't notice how cancellation would stop executors, only clear up >> afterwards. What about a volatile flag? > > Right. Once executors start they run to completion. However, we create a new > executor every externalSortBufferSize (by default 4000) bindings and only if > the iterator has not being canceled. > > Yes, we can add a flag to stop executors immediately as soon as the iterator > gets canceled. > > Paolo > >> >>>> VARS ?x ?y . >>>> >>>> Set the variables in force for subsequent rows, >>>> until the next VARS directive. >>>> We need VARS because it's not always possible to determine all >>>> the possible variables before starting to write out bindings. >>> This is not completely clear to me. An example of when it's not possible >>> to determine all the possible variables before starting to write out >>> binding >>> will probably convince me and help me to clarify. >> Support you have an Iterator<Binding> from a LeftJoin or a Union. One >> way is to statically determine the variables, the other is to be relaxed >> and output based on the Bindings seen. Static analysis requires the >> info to be passed from query execution into, for example, the heart of >> >> The first might have ?x, ?z, the second ?x, ?y, ?z from an OPTIONAL. The >> separation of the code from the static analysis >> >> If you set it once at the start, that also works. >> >> And you can concat streams. >> >> Andy >
