[jira] [Updated] (ATLAS-616) Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large scale.

Hemanth Yamijala (JIRA) Thu, 14 Apr 2016 03:52:26 -0700

     [ 
https://issues.apache.org/jira/browse/ATLAS-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hemanth Yamijala updated ATLAS-616:
-----------------------------------
    Attachment: heap.png

An update:

As described above, all indications to cause of the problem were pointing 
towards the weak references that were holding on the GremlinGroovy script 
bindings. From what I could see in the code, there are no knobs to adjust / 
tune this value in the version of the library we are using.

As a next step, I tried to see whether GC settings could be tuned to accomplish 
this, and ran across this link: http://stackoverflow.com/a/604395 which pointed 
to a GC config {{-XX:SoftRefLRUPolicyMSPerMB=<value>}}. Likewise, the Sun JDK 
documentation 
(http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/java.html) says: 

bq. -XX:SoftRefLRUPolicyMSPerMB=0 This flag enables aggressive processing of 
software references. Use this flag if the software reference count has an 
impact on the Java HotSpot VM garbage collector.

Given the above hints, I ran a test with this setting, set to 0 and also to 
100. In both cases, the GC performance dramatically improved and I was able to 
increase the number of tests to get linear performance. [~ssainath] helped me 
to run these tests in a server environment (still with JDK 7) and got similar 
results. The attached graph is from a server environment running a total of 
3600 queries. We even tested up to 7200 queries. Each run scaled linearly with 
time, and the logs had no concurrency issues etc. The GC patterns are stable as 
can be seen above.

We are going to test on OpenJDK 8 as well to see what the impact is, and if 
things go fine, I can put up a patch that just suggests the settings to enable 
on the server for such loads.

For reference, the GC settings I use are:
{code}
export ATLAS_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:MaxNewSize=3072m 
-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=512m 
-Djava.net.preferIPv4Stack=true  -Xmx10240m -Xms10240m 
-XX:+PrintTenuringDistribution  -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=dumps/atlas_server.hprof -XX:PermSize=100M 
-Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails 
-XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dlog4j.configuration=atlas-log4j.xml"
{code}

In addition to this effort, I also plan to write on the Tinkerpop mailing list 
to see if they have any suggestions for tuning this / fixing this in code.

> Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large 
> scale. 
> -------------------------------------------------------------------------------------
>
>                 Key: ATLAS-616
>                 URL: https://issues.apache.org/jira/browse/ATLAS-616
>             Project: Atlas
>          Issue Type: Bug
>         Environment: Atlas with External kafka / HBase / Solr
> The test is run on cluster setup.
> Machine 1 - Atlas , Solr
> Machine 2 - Kafka , HBase
> Machine 3 - Hive , client
>            Reporter: Sharmadha Sainath
>            Assignee: Hemanth Yamijala
>         Attachments: baseline-1000-3600-10g-heap.png, heap.png, 
> no-dsl-1000-14400-10g-heap.png, zk-exception-stacktrace.rtf
>
>
> The test plan is to simulate 'n' number of users fire 'm' number of queries 
> at Atlas simultaneously. This is accomplished with the help of Apache Jmeter.
> Atlas is populated with 10,000 tables. 
> • 6000 small sized tables (10 columns)
> • 3000 medium sized tables (50 columns)
> • 1000 large sized tables (100 columns)
>  The test plan consists of 30 users firing a set of 3 queries continuously 
> for 20 times in a loop. Added -Xmx10240m -XX:MaxPermSize=512m to ATLAS_OPTS . 
> Zookeeper throws exceptions when the test plan is run and Jmeter starts 
> firing queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ATLAS-616) Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large scale.

Reply via email to