It sounds like Paul and John would both benefit from reviewing [1] & [2].
Drill's has memory management, respects limits and has a hierarchy of allocators to do this. The framework for constraining certain operations, fragments or queries all exists. (Note that this is entirely focused on off-heap memory, in general Drill tries to avoid ever moving data on heap.) Workload management is another topic and there is an initial proposal out on that for comment here: [2] The parallelization algorithms don't currently support heterogeneous nodes. I'd suggest that initial work be done on adding or removing same sized nodes. A separate substantial effort would be involved in better lopsided parallelization and workload decisions. (Let's get the basics right first.) With regards to Paul's comments on 'inside Drill' threading, I think you're jumping to some incorrect conclusions. There hasn't been any formal proposals to change the threading model. There was a very short discussion a month or two back where Hanifi said he'd throw out some prototype code but nothing has been shared since. I suggest you assume the current threading model until there is a consensus around something new. [1] https://github.com/apache/drill/blob/master/exec/memory/base/src/main/java/org/apache/drill/exec/memory/README.md [2] https://docs.google.com/document/d/1xK6CyxwzpEbOrjOdmkd9GXf37dVaf7z0BsvBNLgsZWs/edit -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, Mar 28, 2016 at 8:43 AM, John Omernik <[email protected]> wrote: > Great summary. I'll fill in some "non-technical" explanations of some > challenges with Memory as I see. Drill Devs, please keep Paul and I > accurate in our understanding. > > First, Memory is already set at the drillbit level... sorta. It's set via > ENV in drill-env, and is not a cluster specific thing. However, I believe > there are some challenges that come into play when you have bits of > different sizes. Drill "may" assume that bits are all the same size, and > thus, if you run a query, depending on which bit is the foreman, and which > fragments land where, the query may succeed or fail. That's not an ideal > situation. I think for a holistic discussion on memory, we need to get some > definitives around how Drill handles memory, especially different sized > nodes, and what changes would need to be made for bits of different size to > work well together on a production cluster. > > This discussion forms the basis of almost all work around memory > management. If we can realistically only have bits of one size in it's > current form, then static allocations are where we are going to be for the > initial Yarn work. I love the idea of scaling up and down, but it will be > difficult to scale an entire cluster worth of bits up and down, so > heterogeneous resource allocations must be a prerequisite to dynamic > allocation discussions (other then just adding and removing whole bits). > > Second, this also plays into the multiple drillbits per node discussion. > If static sized bits are our only approach, then the initial reaction is to > make them smaller so you have some granularity in scaling up and down. > This may actually hurt a cluster. Large queries may be challenged by > trying to fit it's fragments on 3 nodes of say 8GB of direct RAM, but that > query would run fine on bits of 24GB of direct RAM. Drill Devs: Keep me > honest here. I am going off of lots of participation in this memory/cpu > discussions when I first started Drill/Marathon integration, and that is > the feeling I got in talking to folks on and off list about memory > management. > > This is a hard topic, but one that I am glad you are spearheading Paul, > because as we see more and more clusters get folded together, having a > citizen that plays nice with others, and provides flexibility with regards > to performance vs resource tradeoffs will be a huge selling/implementation > point of any analytics tool. If it's hard to implement and test at scale > without dedicated hardware, it won't get a fair shake. > > John > > > On Sun, Mar 27, 2016 at 3:25 PM, Paul Rogers <[email protected]> wrote: > > > Hi John, > > > > The other main topic of your discussion is memory management. Here we > seem > > to have 6 topics: > > > > 1. Setting the limits for Drill. > > 2. Drill respects the limits. > > 3. Drill lives within its memory “budget.” > > 4. Drill throttles work based on available memory. > > 5. Drill adapts memory usage to available memory. > > 6. Some means to inform Drill of increases (or decreased) in memory > > allocation. > > > > YARN, via container requests, solves the first problem. Someone (the > > network admin) has to decide on the size of each drill-bit container, but > > YARN handles allocating the space, preventing memory oversubscription, > and > > enforcing the limit (by killing processes that exceed their allocation.) > > > > As you pointed out, memory management is different than CPU: we can’t > just > > expect Linux to silently give us more or less depending on load. Instead, > > Drill itself has to actively request and release memory (and know what to > > do in each case.) > > > > Item 2 says that Drill must limit its memory use. The JVM enforces heap > > size. (As the heap is exhausted, the a Java program gets slower due to > > increased garbage collection events until finally it receives an > > out-of-memory error. > > > > At present I’m still learning the details of how Drill manages memory so, > > by necessity, most of what follows is at the level of “what we could do” > > rather than “how it works today.” Drill devs, please help fill in the > gaps. > > > > The docs. suggest we have a variety of settings that configure drill > > memory (heap size, off-heap size, etc.) I need to ask around more to > learn > > if Drill does, in fact, limit its off-heap memory usage. If not, then > > perhaps this is a change we want to make. > > > > Once Drill respects memory limits, we move to item 3: Drill should live > > within the limits. By this I mean that query operations should work with > > constrained memory, perhaps by spilling to disk — it is not sufficient to > > simply fail when memory is exhausted. Again, I don’t yet know where we’re > > at here, but I understand we may still have a bit of work to do to > achieve > > this goal. > > > > Item 4 looks at the larger picture. Suppose a Drill-bit has 32GB of > memory > > available to it. We do the work needed so that any given query can > succeed > > within this limit (perhaps slowly if operations spill to disk.) But, what > > happens when the same Drill-bit now has to process 10 such queries or > 100? > > We now have a much harder problem: having the collection of ALL queries > > live within the same 32GB limit. > > > > One solution is to simply hold queries in a queue when memory (or even > > CPU) becomes impacted. That is, rather than trying to run all 100 queries > > at once (slowly), perhaps run 20 at a time (quickly), allowing each a > much > > larger share of memory. > > > > Drill already has queues, but they are off by default. We may have to > look > > at turning them on by default. Again, I’m not familiar with our queuing > > strategy, but there seems quite a bit we could do to release queries from > > the queue only when we can give them adequate resources on each > drill-bit. > > > > Item 5 says that Drill should be opportunistic. If some external system > > can grant a temporary loan of more memory, Drill should be able to use > it. > > When the loan is revoked, Drill should relinquish the memory, perhaps by > > spilling the data to disk (or moving the data to other parts of memory.) > > Java programs can’t release heap memory, but Drill uses off-heap, so it > is > > at least theoretically possible to release memory back to the OS. Sounds > > like Drill has a number of improvements needed before Drill can actually > > release off-heap memory. > > > > Finally, item 6 says we need that external system to loan Drill the extra > > memory. With CPU, the process scheduler can solve the problem all on its > > own by looking at system load and deciding, at any instant, which > processes > > to run. Memory is harder. > > > > One solution would be for YARN to resize Drill’s container. But, YARN > does > > not yet support resizing containers. YARN-1197: "Support changing > resources > > of an allocated container" [1] describes the tasks needed to get there. > > Once that feature is complete, YARN will let an application ask for more > > memory (or release excess memory). Presumably the app or a user must > decide > > to request more memory. For example, the admin might dial up Drill memory > > during the day when the marketing folks are running queries, but dial it > > back at night when mostly batch jobs run. > > > > The ability to manually change memory is great, but the ideal would be > > have some automated way to use free memory on each node. Llama does this > in > > an ad-hoc manner. A quick search on YARN did not reveal anything in this > > vein, so we’ll have to research this idea a bit more. I wonder, though, > if > > Drill could actually handle fast-moving allocation changes; change on the > > order of the lifetime of a query seems more achievable (that is, on the > > order of minutes to hours). > > > > In short, it seems we have quite a few tasks ahead in the area of memory > > management. Each seems achievable, but each requires work. The > > Drill-on-YARN project is just a start: it helps the admin allocate memory > > between Drill and other apps. > > > > Thanks, > > > > - Paul > > > > > > [1] https://issues.apache.org/jira/browse/YARN-1197 < > > https://issues.apache.org/jira/browse/YARN-1197> > > > > > > > On Mar 26, 2016, at 6:48 AM, John Omernik <[email protected]> wrote: > > > > > > Paul - > > > > > > Great write-up. > > > > > > Your description of Llama and Yarn is both informative and troubling > for > > a > > > potential cluster administrator. Looking at this solution, it would > > appear > > > that to use Yarn with Llama, the "citizen" in this case Drill would > have > > to > > > be an extremely good citizen and honor all requests from Llama related > to > > > deallocation and limits on resources while in reality there is no > > > enforcement mechanisms. Not that I don't think Drill is a great tool > > > written well by great people, but I don't know if I would want to leave > > my > > > cluster SLAs up to Drill bits doing the self regulation. Edge cases, > etc > > > causing a Drillbit to start taking more resources would be very > impactful > > > to a cluster, and with more and more people going to highly concurrent, > > > multi-tenant solutions, this becomes a HUGE challenge. > > > > > > Obviously dynamic allocation, flexing up and down to use "spare" > cluster > > > resources is very important to many cluster/architecture > administrators, > > > but if I had to guess, SLAs/Workload guarantees would rank higher. > > > > > > The Llama approach seems to be to much of a "house of cards" to me to > be > > > viable, and I worry that long term it may not be best for a product > like > > > Drill. Our goal I think should be to play nice with others, if our core > > > philosophy in integration is playing nice with others, it will only > help > > > adoption and people giving it a try. So back to Drill on Yarn > > (natively)... > > > > > > A few questions around this. You mention that resource allocations are > > > mostly a gentlemen's agreement. Can you explore that a bit more? I do > > > believe there is Cgroup support in Yarn. (I know the Myriad project is > > > looking to use Cgroups). So is this gentlemen's agreement more about > > when > > > Cgroups is NOT enabled? Thus it is only the word of the the process > > > running in the container in Yarn? If this is the case, then has there > > been > > > any research on the stability of Cgroups and the implementation in > Yarn? > > > Basically, Poll: Are you using Yarn? If so are you using Cgroups? If > not, > > > why? If you are using them, any issues? This may be helpful in what > we > > > are looking to do with Drill. > > > > > > "Hanifi’s work will allow us to increase or decrease the number of > cores > > we > > > consume." Do you have any JIRAs I can follow on this, I am very > > interested > > > in this. One of the benefits of CGroups in Mesos as it relates to CPU > > > shares is a sorta built in Dynamic allocation. And it would be > > interesting > > > to test a Yarn Cluster with Cgroups enabled (once a basic Yarn Aware > > Drill > > > bit is enabled) to see if Yarn reacts the same way. > > > > > > Basically, when I run a drillbit on a node with Cgroup isolation > enabled > > in > > > Marathon on Mesos, lets say I have 16 total cores on the node. For me, > I > > > run my Mesos-Agent with "14" available Vcores... Why? Static > allocation > > of > > > 2 vcores for MapR-FS. Those 14 vcores are now available to tasks on > the > > > agent. When I start the drillbit, let's say I allocate 8 vcores to the > > > drillbit in Marathon. Drill runs queries, and let's say the actual CPU > > > usage on this node is minimal at the time, Drill, because it is not > > > currently CPU aware, takes all the CPU it can. (it will use all 16 > > cores). > > > Query finishes it goes back to 0. But what happens if MapR is heavily > > using > > > it's 2 cores? Well , Cgroups detects contention and limits Drill > because > > > it's only allocated 8 shares of those 14 it's aware of, this gives > > priority > > > to the MapR operations. Even more so, if there are other Mesos tasks > > asking > > > for CPU shares, Drill's CPU share is being scaled back, not by telling > > > Drill it can't use core, but by processing what Drill is trying to do > > > slower compared to the rest of the work loads. I am know I am dumbing > > > this down, but that's how I understand Cgroups working. Basically, I > was > > > very concerned when I first started doing Drill queries in Mesos, and I > > > posted to the Mesos list to which some people smarter than I took the > > time > > > to explain things. (Vinode, you are lurking on this list, thanks > again!) > > > > > > In a way, this is actually a nice side effect of Cgroup Isolation, from > > > Drill's perspective it gets all the CPU, and is only scaled back on > > > contention. So, my long explanation here is to bring things back to > the > > > Yarn/Cgroup/Gentlemen's agreement comment. I'd really want to > understand > > > this. As a cluster administrator, I can guarantee a level of resources > > with > > > Mesos, Can I get that same guarantee in Yarn? Is it only with certain > > > settings? I just want to be 100% clear that if we go the route, and > make > > > Drill work on Yarn, that in our documentation/instructions we are > > explicit > > > in what we are giving the user on Yarn. To me, a bad situation would > > occur > > > when someone thinks all will be well when they run Drill on Yarn, and > > > because they are not aware of their own settings (say not enabling > > Cgroups) > > > They blame Drill for breaking something. > > > > > > So that falls back to Memory and scaling memory in Drill. Memory for > > > obvious reason can't operate like CPU with Cgroups. You can't allocated > > all > > > memory to all the things, and then scale back if contention. So being a > > > complete neophyte on the inner workings of Drill memory. What options > > would > > > exist for allocation of memory. Could we trigger events that would > > > allocate up and down what memory a given drillbit can use so it self > > > limits? It currently self limits because we can see the memory settings > > in > > > drill-env.sh. But what about changing that at a later time? Is it > > easier > > > to change direct memory limits rather than Heap? > > > > > > Hypothesis 1: If Direct memory isn't allocated (i.e. a drillbit is > idle) > > > then setting what it could POTENTIALLY use if a query would come in > would > > > be easier then actually deallocating Heap that's been used. Hypothesis > 2: > > > Direct Memory, if it's truly deallocated when not in use is more about > > the > > > limit of what it could use not about allocated or deallocating memory. > > > Hence a nice step one may be to allow this to change as needed by an > > > Application Master in Yarn (or a Mesos Framework) > > > > > > If the work to change the limit on Direct Memory usage was easier, it > may > > > be a good first step, (assuming I am not completely wrong on memory > > > allocation) if we have to statically allocate Heap, and it's a big > change > > > in code to make that dynamic, but Direct Memory is easy to change, > > that's a > > > great first feature, without boiling the ocean. Obviously lots of > > > assumptions here, but I am just thinking outloud. > > > > > > Paul - When it comes to Application in Yarn, and the containers that > the > > > Application Master allocates, can containers be joined? Let's say I am > > an > > > application master, and I allocated 4 CPU Cores and 16 GB of ram to a > > > Drillbit. (8 for Heap and 8 for Direct) . Then at a later time I can > add > > > more memory to the drill bit.... If my assumptions worked on Direct > > memory > > > in Drill, could my Application master tell the drill bit, ok you can > use > > 16 > > > GB of direct memory now (i.e. the AM asks the RM to allocate 8 more GB > of > > > ram on that node, the RM agrees, and allocates another container, can > it > > > just resize, or would that not work? I guess what I am describing here > is > > > sorta what Llama is doing... but I am actually talking about the > ability > > to > > > enforce the quotas.... This may actually be a question that fits into > > your > > > discussion on resizing Yarn containers more than anything. > > > > > > So i just tossed out a bunch of ideas here to keep discussion running. > > > Drill Devs, I would love a better understanding of the memory > allocation > > > mechanisms within Drill. (High level, neophyte here). I do feel as a > > > cluster admin, as I have said, that the Llama approach (now that I > > > understand it better) would worry me, especially in a multi-tenant > > > cluster. And as you said Paul, it "feels" hacky. > > > > > > Thanks for this discussion, it's a great opportunity for Drill adoption > > as > > > clusters go more and more multi-tenant/multi-use. > > > > > > John > > > > > > > > > > > > > > > On Fri, Mar 25, 2016 at 5:45 PM, Paul Rogers <[email protected] > > <mailto:[email protected]>> wrote: > > > > > >> Hi Jacques, > > >> > > >> Llama is a very investing approach; I read their paper [1] early on; > > just > > >> went back and read it again. Basically, Llama (as best as I can tell) > > has a > > >> two-part solution. > > >> > > >> First, Impala is run off-YARN (that is, not in a YARN container). > Llama > > >> uses “dummy” containers to inform YARN of Impala’s resource usage. > They > > can > > >> grow/shrink static allocations by launching more dummy containers. > Each > > >> dummy container does nothing other than inform off-YARN Impala of the > > >> container resources. Rather clever, actually, even if it “abuses the > > >> software” a bit. > > >> > > >> Secondly, Llama is able to dynamically grab spare YARN resources on > each > > >> node. Specifically, Llama runs a Node Manager (NM) plugin that watches > > >> actual node usage. The plugin detects the free NM resources and > informs > > >> Impala of them. Impala then consumes the resources as needed. When the > > NM > > >> allocates a new container, the plugin informs Impala which > relinquishes > > the > > >> resources. All this works because YARN allocations are mostly a > > gentleman’s > > >> agreement. Again, this is pretty clever, but only one app per node can > > play > > >> this game. > > >> > > >> The Llama approach could work for Drill. The benefit is that Drill > runs > > as > > >> it does today. Hanifi’s work will allow us to increase or decrease the > > >> number of cores we consume. The draw-back is that Drill is not yet > > ready to > > >> play the memory game: it can’t release memory back to the OS when > > >> requested. Plus, the approach just smells like a hack. > > >> > > >> The “pure-YARN” approach would be to let YARN start/stop the > Drill-bits. > > >> The user can grow/shrink Drill resources by starting/stopping > > Drill-bits. > > >> (This is simple to do if one ignores data locality and starts each > > >> Drill-bit on a separate node. It is a bit more work if one wants to > > >> preserve data locality by being rack-aware, or by running multiple > > >> drill-bits per node.) > > >> > > >> YARN has been working on the ability to resize running containers. > (See > > >> YARN-1197 - Support changing resources of an allocated container [2]) > > Once > > >> that is available, we can grow/shrink existing Drill-bits (assuming > that > > >> Drill itself is enhanced as discussed above.) The promise of resizable > > >> containers also suggests that the “pure-YARN” approach is workable. > > >> > > >> Once resizable containers are available, one more piece is needed to > let > > >> Drill use free resources. Some cluster-wide component must detect free > > >> resources and offer them to applications that want them, deciding how > to > > >> divvy up the resources between, say, Drill and Impala. The same piece > > would > > >> revoke resources when paying YARN customers need them. > > >> > > >> Of course, if the resizable container feature come too late, or does > not > > >> work well, we still have the option of going off-YARN using the Llama > > >> trick. But the Llama trick does nothing to do the cluster-wide > > coordination > > >> discussed above. > > >> > > >> So, the thought is: start simple with a “stock” YARN app. Then, we can > > add > > >> bells and whistles as we gain experience and as YARN offers more > > >> capabilities. > > >> > > >> The nice thing about this approach is that the same idea plays well > with > > >> Mesos (though the implementation is different). > > >> > > >> Thanks, > > >> > > >> - Paul > > >> > > >> [1] http://cloudera.github.io/llama/ < > http://cloudera.github.io/llama/> > > <http://cloudera.github.io/llama/ <http://cloudera.github.io/llama/>> > > >> [2] https://issues.apache.org/jira/browse/YARN-1197 < > > https://issues.apache.org/jira/browse/YARN-1197> < > > >> https://issues.apache.org/jira/browse/YARN-1197 < > > https://issues.apache.org/jira/browse/YARN-1197>> > > >> > > >>> On Mar 24, 2016, at 2:34 PM, Jacques Nadeau <[email protected] > > <mailto:[email protected]>> wrote: > > >>> > > >>> Your proposed allocation approach makes a lot of sense. I think it > will > > >>> solve a large number of use cases. Thanks for giving an overview of > the > > >>> different frameworks. I wonder if they got too focused on the simple > > use > > >>> case.... > > >>> > > >>> Have you looked at LLama to see whether we could extend it for our > > needs? > > >>> Its Apache licensed and probably has at least a start at a bunch of > > >> things > > >>> we're trying to do. > > >>> > > >>> https://github.com/cloudera/llama <https://github.com/cloudera/llama > > > > >>> > > >>> -- > > >>> Jacques Nadeau > > >>> CTO and Co-Founder, Dremio > > >>> > > >>> On Tue, Mar 22, 2016 at 7:42 PM, Paul Rogers <[email protected]> > > >> wrote: > > >>> > > >>>> Hi Jacques, > > >>>> > > >>>> I’m thinking of “semi-static” allocation at first. Spin up a cluster > > of > > >>>> Drill-bits, after which the user can add or remove nodes while the > > >> cluster > > >>>> runs. (The add part is easy, the remove part is a bit tricky since > we > > >> don’t > > >>>> yet have a way to gracefully shut down a Drill-bit.) Once we get the > > >> basics > > >>>> to work, we can incrementally try out dynamics. For example, someone > > >> could > > >>>> whip up a script to look at load and use the proposed YARN client > app > > to > > >>>> adjust resources. Later, we can fold dynamic load management into > the > > >>>> solution once we’re sure what folks want. > > >>>> > > >>>> I did look at Slider, Twill, Kitten and REEF. Kitten is too basic. I > > had > > >>>> great hope for Slider. But, it turns out that Slider and Weave have > > each > > >>>> built an elaborate framework to isolate us from YARN. The Slider > > >> framework > > >>>> (written in Python) seems harder to understand than YARN itself. At > > >> least, > > >>>> one has to be an expert in YARN to understand what all that Python > > code > > >>>> does. And, just looking at the class count in the Twill Javadoc was > > >>>> overwhelming. Slider and Twill have to solve the general case. If we > > >> build > > >>>> our own Java solution, we only have to solve the Drill case, which > is > > >>>> likely much simpler. > > >>>> > > >>>> A bespoke solution would seem to offer some other advantages. It > lets > > us > > >>>> do things like integrate ZK monitoring so we can learn of zombie > drill > > >> bits > > >>>> (haven’t exited, but not sending heartbeat messages.) We can also > > gather > > >>>> metrics and historical data about the cluster as a whole. We can try > > out > > >>>> different cluster topologies. (Run Drill-bits on x of y nodes on a > > rack, > > >>>> say.) And, we can eventually do the dynamic load management we > > discussed > > >>>> earlier. > > >>>> > > >>>> But first, I look forward to hearing what others have tried and what > > >> we’ve > > >>>> learned about how people want to use Drill in a production YARN > > cluster. > > >>>> > > >>>> Thanks, > > >>>> > > >>>> - Paul > > >>>> > > >>>> > > >>>>> On Mar 22, 2016, at 5:45 PM, Jacques Nadeau <[email protected]> > > >> wrote: > > >>>>> > > >>>>> This is great news, welcome! > > >>>>> > > >>>>> What are you thinking in regards to static versus dynamic resource > > >>>>> allocation? We have some conversations going regarding workload > > >>>> management > > >>>>> but they are still early so it seems like starting with > > user-controlled > > >>>>> allocation makes sense initially. > > >>>>> > > >>>>> Also, have you spent much time evaluating whether one of the > existing > > >>>> YARN > > >>>>> frameworks such as Slider would be useful? Does anyone on the list > > have > > >>>> any > > >>>>> feedback on the relative merits of these technologies? > > >>>>> > > >>>>> Again, glad to see someone picking this up. > > >>>>> > > >>>>> Jacques > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Jacques Nadeau > > >>>>> CTO and Co-Founder, Dremio > > >>>>> > > >>>>> On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers <[email protected] > > > > >>>> wrote: > > >>>>> > > >>>>>> Hi All, > > >>>>>> > > >>>>>> I’m a new member of the Drill Team here at MapR. We’d like to > take a > > >>>> look > > >>>>>> at running Drill on YARN for production customers. JIRA suggests > > some > > >>>> early > > >>>>>> work may have been done (DRILL-142 < > > >>>>>> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 < > > >>>>>> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675 < > > >>>>>> https://issues.apache.org/jira/browse/DRILL-3675>). > > >>>>>> > > >>>>>> YARN is a complex beast and the Drill community is large and > > growing. > > >>>> So, > > >>>>>> a good place to start is to ask if anyone has already done work on > > >>>>>> integrating Drill with YARN (see DRILL-142)? Or has thought about > > >> what > > >>>>>> might be needed? > > >>>>>> > > >>>>>> DRILL-1170 (YARN support for Drill) seems a good place to gather > > >>>>>> requirements, designs and so on. I’ve posted a “starter set” of > > >>>>>> requirements to spur discussion. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> > > >>>>>> - Paul > > > > >
