Wierd....  That many threads seems wrong.....

On 7/7/15 8:35 PM, Ian Maxon wrote:
I think I have at least a workaround to the thread starvation nailed
down. We'll have to see, but basically I think the latest few patches
cause us to use more threads for whatever reason- and this pushed us
over the default thread cap in many circumstances (not always). Going
ahead and setting the number of processes to be unlimited within the
build server and containers seems to have put out the fire, so to
speak. Another confounding factor is the issue that docker containers
run within the same host and hence also have their own shared thread
limit, in addition to the host's thread limit. It's not clear to me
however whether we intend to use that many threads (~500), or if
there's a subtle resource leak somewhere.

- Ian

On Tue, Jul 7, 2015 at 5:44 PM, Eldon Carman <[email protected]> wrote:
In my branch ("ecarm002/introspection_alternate"), I have adapted some code
I received from Ildar to repeatedly test a set of runtime tests. I am not
sure this testing process will be related to your issue or not. I found
this class very helpful in finding the error that was causing my problem
for introspection. You could add the feeds test to the
repeatedtestsuite.xml and try running it. The process might help you cause
the error locally.

https://github.com/ecarm002/incubator-asterixdb/tree/ecarm002/introspection_alternate

edu.uci.ics.asterix.test.runtime.RepteatedTest




On Mon, Jul 6, 2015 at 8:25 PM, Ian Maxon <[email protected]> wrote:

Raman and I worked on getting to the root of what is causing the build
instability for a while today. The investigation is still ongoing but
so far we've discovered the following things:

- The OOM error specifically is running out of threads to create on
the machine, which is odd. We aren't creating more than 500 threads
per JVM during testing so this is especially puzzling. The heap size
or permgen size are not the issue.

- The OOM error can be observed at the point where only feeds was
merged (and not YARN or the managix scripting fix)

- Neither of us can reproduce this locally on our development
machines. It seems that the environment is a variable in this issue
(hitting the thread limit on the machine), somehow.

- Where or if the tests run out of threads is not deterministic. It
tends to fail around the feeds portion of the execution tests, but
this is only a loose pattern. They can all pass, or the OOM can be hit
during integration tests, or other totally unrelated execution tests.

- There are a few feeds tests which sometimes fail (namely issue_711
and feeds_10) but this is totally unrelated to the more major issues
of running out of threads on the build machine.

Given all the above, it looks like there is at least a degree of
configuration/environmental influence on this issue.

- Ian



On Mon, Jul 6, 2015 at 2:14 PM, Raman Grover <[email protected]>
wrote:
Hi

a) The two big commits to the master (YARN integration and feeds)
happened
as atomic units that makes it easier to
reset the master to the version prior to each feature and verify if the
build began showing OOM after each of the suspected commits. We have a
pretty deterministic way of nailing down the commit that introduced the
problem. I would suggest, instead of disabling the feeds tests, can we
revert to the earlier commit and confirm if the feeds commit did
introduce
the behavior and repeat the test with the YARN commit that followed. We
should be able to see sudden increase/drop in build stability by running
sufficient number of iterations.

b) I have not been able to reproduce the OOM at my setup where I have
been
running the build repeatedly.
@Ian are you able to reproduce it at your system? May be I am not running
the build sufficient number of times?
I am still not able to understand how removal of test cases still causes
the OOM? I can go back and look at the precise changes made during the
feeds commit that could introduce OOM even if feeds are not involved at
all, but as I see it, the changes made do not play a role if feeds are
not
being ingested.


Regards,
Raman


On Thu, Jul 2, 2015 at 6:42 PM, Ian Maxon <[email protected]> wrote:

Hi all,

We are close to having a release ready, but there's a few things left
on the checklist before we can cut the first Apache release. I think
most things on this list are underway, but I'll put them here just for
reference/visibility. Comments and thoughts are welcomed.

- Build stability after merging YARN and Feeds seems to have seriously
declined. It's hard to get a build to go through to the end without
going OOM at all now honestly, so this is a Problem. I think it may be
related to Feeds, but even after disabling the tests
(https://asterix-gerrit.ics.uci.edu/#/c/312/), I still see it.
Therefore I am not precisely sure what is going on, but it only
started to happen after we merged those two features. It's not exactly
obvious to me where the memory leak is coming from. @Raman, it would
be great to get your advice/thoughts on this.

- Metadata name changes and Metadata caching consistency fixes are
underway by Ildar.

- The repackaging and license checker patches still need to be merged
in, but this should happen after the above two features are merged.
They are otherwise ready for review though.

- Now that Feeds is merged, the Apache website should be changed to
the new version that has been in draft form for a few weeks now.
Before it may have been a little premature, but now it should be
accurate. The documentation site should also be reverted to its prior
state, before it was quickly patched to serve as an interim website.


If there's anything else I am missing that should be in this list,
please feel free to add it into this thread.

Thanks,
-Ian



--
Raman

Reply via email to