Shouldn’t those threads get reused in a fixed size pool? > On Jul 7, 2015, at 23:40, Mike Carey <[email protected]> wrote: > > Wierd.... That many threads seems wrong..... > > On 7/7/15 8:35 PM, Ian Maxon wrote: >> I think I have at least a workaround to the thread starvation nailed >> down. We'll have to see, but basically I think the latest few patches >> cause us to use more threads for whatever reason- and this pushed us >> over the default thread cap in many circumstances (not always). Going >> ahead and setting the number of processes to be unlimited within the >> build server and containers seems to have put out the fire, so to >> speak. Another confounding factor is the issue that docker containers >> run within the same host and hence also have their own shared thread >> limit, in addition to the host's thread limit. It's not clear to me >> however whether we intend to use that many threads (~500), or if >> there's a subtle resource leak somewhere. >> >> - Ian >> >> On Tue, Jul 7, 2015 at 5:44 PM, Eldon Carman <[email protected]> wrote: >>> In my branch ("ecarm002/introspection_alternate"), I have adapted some code >>> I received from Ildar to repeatedly test a set of runtime tests. I am not >>> sure this testing process will be related to your issue or not. I found >>> this class very helpful in finding the error that was causing my problem >>> for introspection. You could add the feeds test to the >>> repeatedtestsuite.xml and try running it. The process might help you cause >>> the error locally. >>> >>> https://github.com/ecarm002/incubator-asterixdb/tree/ecarm002/introspection_alternate >>> >>> edu.uci.ics.asterix.test.runtime.RepteatedTest >>> >>> >>> >>> >>> On Mon, Jul 6, 2015 at 8:25 PM, Ian Maxon <[email protected]> wrote: >>> >>>> Raman and I worked on getting to the root of what is causing the build >>>> instability for a while today. The investigation is still ongoing but >>>> so far we've discovered the following things: >>>> >>>> - The OOM error specifically is running out of threads to create on >>>> the machine, which is odd. We aren't creating more than 500 threads >>>> per JVM during testing so this is especially puzzling. The heap size >>>> or permgen size are not the issue. >>>> >>>> - The OOM error can be observed at the point where only feeds was >>>> merged (and not YARN or the managix scripting fix) >>>> >>>> - Neither of us can reproduce this locally on our development >>>> machines. It seems that the environment is a variable in this issue >>>> (hitting the thread limit on the machine), somehow. >>>> >>>> - Where or if the tests run out of threads is not deterministic. It >>>> tends to fail around the feeds portion of the execution tests, but >>>> this is only a loose pattern. They can all pass, or the OOM can be hit >>>> during integration tests, or other totally unrelated execution tests. >>>> >>>> - There are a few feeds tests which sometimes fail (namely issue_711 >>>> and feeds_10) but this is totally unrelated to the more major issues >>>> of running out of threads on the build machine. >>>> >>>> Given all the above, it looks like there is at least a degree of >>>> configuration/environmental influence on this issue. >>>> >>>> - Ian >>>> >>>> >>>> >>>> On Mon, Jul 6, 2015 at 2:14 PM, Raman Grover <[email protected]> >>>> wrote: >>>>> Hi >>>>> >>>>> a) The two big commits to the master (YARN integration and feeds) >>>> happened >>>>> as atomic units that makes it easier to >>>>> reset the master to the version prior to each feature and verify if the >>>>> build began showing OOM after each of the suspected commits. We have a >>>>> pretty deterministic way of nailing down the commit that introduced the >>>>> problem. I would suggest, instead of disabling the feeds tests, can we >>>>> revert to the earlier commit and confirm if the feeds commit did >>>> introduce >>>>> the behavior and repeat the test with the YARN commit that followed. We >>>>> should be able to see sudden increase/drop in build stability by running >>>>> sufficient number of iterations. >>>>> >>>>> b) I have not been able to reproduce the OOM at my setup where I have >>>> been >>>>> running the build repeatedly. >>>>> @Ian are you able to reproduce it at your system? May be I am not running >>>>> the build sufficient number of times? >>>>> I am still not able to understand how removal of test cases still causes >>>>> the OOM? I can go back and look at the precise changes made during the >>>>> feeds commit that could introduce OOM even if feeds are not involved at >>>>> all, but as I see it, the changes made do not play a role if feeds are >>>> not >>>>> being ingested. >>>>> >>>>> >>>>> Regards, >>>>> Raman >>>>> >>>>> >>>>> On Thu, Jul 2, 2015 at 6:42 PM, Ian Maxon <[email protected]> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> We are close to having a release ready, but there's a few things left >>>>>> on the checklist before we can cut the first Apache release. I think >>>>>> most things on this list are underway, but I'll put them here just for >>>>>> reference/visibility. Comments and thoughts are welcomed. >>>>>> >>>>>> - Build stability after merging YARN and Feeds seems to have seriously >>>>>> declined. It's hard to get a build to go through to the end without >>>>>> going OOM at all now honestly, so this is a Problem. I think it may be >>>>>> related to Feeds, but even after disabling the tests >>>>>> (https://asterix-gerrit.ics.uci.edu/#/c/312/), I still see it. >>>>>> Therefore I am not precisely sure what is going on, but it only >>>>>> started to happen after we merged those two features. It's not exactly >>>>>> obvious to me where the memory leak is coming from. @Raman, it would >>>>>> be great to get your advice/thoughts on this. >>>>>> >>>>>> - Metadata name changes and Metadata caching consistency fixes are >>>>>> underway by Ildar. >>>>>> >>>>>> - The repackaging and license checker patches still need to be merged >>>>>> in, but this should happen after the above two features are merged. >>>>>> They are otherwise ready for review though. >>>>>> >>>>>> - Now that Feeds is merged, the Apache website should be changed to >>>>>> the new version that has been in draft form for a few weeks now. >>>>>> Before it may have been a little premature, but now it should be >>>>>> accurate. The documentation site should also be reverted to its prior >>>>>> state, before it was quickly patched to serve as an interim website. >>>>>> >>>>>> >>>>>> If there's anything else I am missing that should be in this list, >>>>>> please feel free to add it into this thread. >>>>>> >>>>>> Thanks, >>>>>> -Ian >>>>>> >>>>> >>>>> >>>>> -- >>>>> Raman >
Best regards, Ildar
