[ https://issues.apache.org/jira/browse/ATLAS-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hemanth Yamijala updated ATLAS-616: ----------------------------------- Attachment: heap.png An update: As described above, all indications to cause of the problem were pointing towards the weak references that were holding on the GremlinGroovy script bindings. From what I could see in the code, there are no knobs to adjust / tune this value in the version of the library we are using. As a next step, I tried to see whether GC settings could be tuned to accomplish this, and ran across this link: http://stackoverflow.com/a/604395 which pointed to a GC config {{-XX:SoftRefLRUPolicyMSPerMB=<value>}}. Likewise, the Sun JDK documentation (http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/java.html) says: bq. -XX:SoftRefLRUPolicyMSPerMB=0 This flag enables aggressive processing of software references. Use this flag if the software reference count has an impact on the Java HotSpot VM garbage collector. Given the above hints, I ran a test with this setting, set to 0 and also to 100. In both cases, the GC performance dramatically improved and I was able to increase the number of tests to get linear performance. [~ssainath] helped me to run these tests in a server environment (still with JDK 7) and got similar results. The attached graph is from a server environment running a total of 3600 queries. We even tested up to 7200 queries. Each run scaled linearly with time, and the logs had no concurrency issues etc. The GC patterns are stable as can be seen above. We are going to test on OpenJDK 8 as well to see what the impact is, and if things go fine, I can put up a patch that just suggests the settings to enable on the server for such loads. For reference, the GC settings I use are: {code} export ATLAS_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:MaxNewSize=3072m -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=512m -Djava.net.preferIPv4Stack=true -Xmx10240m -Xms10240m -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -XX:PermSize=100M -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dlog4j.configuration=atlas-log4j.xml" {code} In addition to this effort, I also plan to write on the Tinkerpop mailing list to see if they have any suggestions for tuning this / fixing this in code. > Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large > scale. > ------------------------------------------------------------------------------------- > > Key: ATLAS-616 > URL: https://issues.apache.org/jira/browse/ATLAS-616 > Project: Atlas > Issue Type: Bug > Environment: Atlas with External kafka / HBase / Solr > The test is run on cluster setup. > Machine 1 - Atlas , Solr > Machine 2 - Kafka , HBase > Machine 3 - Hive , client > Reporter: Sharmadha Sainath > Assignee: Hemanth Yamijala > Attachments: baseline-1000-3600-10g-heap.png, heap.png, > no-dsl-1000-14400-10g-heap.png, zk-exception-stacktrace.rtf > > > The test plan is to simulate 'n' number of users fire 'm' number of queries > at Atlas simultaneously. This is accomplished with the help of Apache Jmeter. > Atlas is populated with 10,000 tables. > • 6000 small sized tables (10 columns) > • 3000 medium sized tables (50 columns) > • 1000 large sized tables (100 columns) > The test plan consists of 30 users firing a set of 3 queries continuously > for 20 times in a loop. Added -Xmx10240m -XX:MaxPermSize=512m to ATLAS_OPTS . > Zookeeper throws exceptions when the test plan is run and Jmeter starts > firing queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)