Thanks for the reference to the paper Tim. Definitely worth a read.
And yes, I think this is important enough a topic that we need a thorough
discussion so the more inputs we can get, the better.


Parth

On Thu, Oct 9, 2014 at 12:37 AM, Timothy Chen <[email protected]> wrote:

> Hi Parth,
>
> Thanks for providing an update, this is really great to see more
> design discussions on the list!
>
> The pipeline chains definitely makes lot of sense, and I still
> remember discussions offline around this in the past.
>
> The global memory efficency seems like a scheduling problem, as the
> delay of a chain only benefits if there are other chains in-flight,
> and be at a chain level or at a query level.
>
> I don't have much to add yet, but love to see how we can start simple on
> this.
>
> One paper that is relevant that I just started to read is from the
> recent OSDI 14, will chime in more once I grasp it.
>
> https://www.usenix.org/conference/osdi14/technical-sessions/presentation/boutin
>
> Tim
>
>
> On Wed, Oct 8, 2014 at 11:03 PM, Parth Chandra <[email protected]>
> wrote:
> > Hi everyone,
> >
> > Aman, Jinfeng and I had an initial offline discussion about memory
> planning
> > in Drill. I’m summarizing the discussion below and hoping to initiate a
> > discussion around some of these ideas.
> >
> >
> > Order of Execution
> > -------------------------
> > Assertion: For memory management Order of Execution is a fundamental
> issue.
> >
> > One of the problems with memory usage in the execution engine is that all
> > operators start up simultaneously and start allocating memory even though
> > the downstream operators may not be ready to consume their output.
> >
> > For example, in the plan below :
> >
> >       Agg
> >        |
> >       HJ2
> >      /    \
> >     HJ1    Scan3
> >    /   \
> > Scan1   Scan2
> >
> >
> > the scan operators all begin reading data simultaneously. In this case,
> the
> > Hash Joins are blocking operations and the output of Scan3 cannot be
> > consumed until HJ1 is ready to emit its results. If, say, Scan2 is on the
> > build side of the hash table for HJ1, then HJ1 will not emit any records
> > until Scan2 completes its operation. If Scan3 starts immediately, it is
> > consuming memory that could be utilized by Scan2. Instead, if we delay
> the
> > start of Scan3 until after HJ1 is ready to emit records, we can utilize
> > memory more efficiently.
> >
> > To address this, we can think of the query plan in terms of pipeline
> > chains, where a pipeline chain is a chain of operators terminated by a
> > blocking operator.
> >
> > In the example, there would be three pipeline chains :
> > PC1 : Scan1-HJ1-HJ2-Agg
> > PC2 : Scan 2
> > PC3 : Scan 3
> >
> > Now, we can see that we can start PC1 and PC2, but PC3 can be delayed
> until
> > PC2 is completed and PC1 has reached HJ2.
> >
> > One thing we need to consider is that multiple major fragments can be
> part
> > of a single pipeline chain. All these major fragments can begin execution
> > if the pipeline chain is ready to begin execution.
> >
> > We need to think this one through, though. There are probably many
> details
> > to be hashed out, though one thing is certain: the planner has to provide
> > some more information to the execution engine in terms of the ordering of
> > the pipeline chains. In other words, implementing this needs work on both
> > the planner and the execution engine. We also need to work out the
> details
> > of how the idea of a pipeline chain will be reconciled with the idea of
> > major/minor fragments which are currently the units of execution.
> >
> > Fragment memory limit
> > —----------------------------
> >    We have implemented a simple method to limit the use of memory by a
> > single fragment in 0.6 (it is disabled by default). This prevents a
> single
> > fragment from hogging too much memory while other fragments may be
> starved.
> > However the current implementation has some drawbacks:
> >   i) The fragment memory allocators have no idea of how much memory is
> > really needed by the fragment. The fragment limit is therefore determined
> > by dividing the available memory *equally* among the fragments. This is
> not
> > a fair method; a better choice would be to allocate fragment limits based
> > on the relative needs of the executing fragments.
> >   ii) The idea of limiting memory use by a fragment is a little too
> narrow,
> > since the purpose is to allow many queries, not fragments, to be able to
> > run together. The current mechanism favours queries that may have many
> > fragments over queries with fewer fragments which may have equivalent
> > memory needs.
> >
> >    To address this, we need to assign memory limits per query instead of
> > per fragment. In addition, we have some estimates at the query level for
> > the amount of memory that the query may need. We should probably change
> the
> > memory limit implementation to use this estimate and assign limits
> > proportionately. In addition, the limit should be calculated per query
> (and
> > assigned to all fragments of the same query). It might be even better if
> we
> > could estimate the memory requirement per fragment and use that as the
> > limit.
> >
> >     Again, some work needs to be done to figure out what data is
> available
> > that the allocators can use and what data can be calculated/estimated at
> > the planning stage to allow the allocators to distribute memory fairly.
> >
> >
> >     All critiques and criticisms are welcome, we’re hoping to get a good
> > discussion going around this.
> >
> >
> > Parth
>

Reply via email to