[ https://issues.apache.org/jira/browse/HDFS-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108453#comment-17108453 ]
Wei-Chiu Chuang commented on HDFS-13639: ---------------------------------------- It would be really great if you can explain the charts, what are the y axis and x axis. > SlotReleaser is not fast enough > ------------------------------- > > Key: HDFS-13639 > URL: https://issues.apache.org/jira/browse/HDFS-13639 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Affects Versions: 2.4.0, 2.6.0, 3.0.2 > Environment: 1. YCSB: > {color:#000000} recordcount=2000000000 > fieldcount=1 > fieldlength=1000 > operationcount=10000000 > > workload=com.yahoo.ycsb.workloads.CoreWorkload > > table=ycsb-test > columnfamily=C > readproportion=1 > updateproportion=0 > insertproportion=0 > scanproportion=0 > > maxscanlength=0 > requestdistribution=zipfian > > # default > readallfields=true > writeallfields=true > scanlengthdistribution=constan{color} > {color:#000000}2. datanode:{color} > -Xmx2048m -Xms2048m -Xmn1024m -XX:MaxDirectMemorySize=1024m > -XX:MaxPermSize=256m -Xloggc:$run_dir/stdout/datanode_gc_${start_time}.log > -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=$log_dir -XX:+PrintGCApplicationStoppedTime > -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 > -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSParallelRemarkEnabled > -XX:+CMSClassUnloadingEnabled -XX:CMSMaxAbortablePrecleanTime=10000 > -XX:+CMSScavengeBeforeRemark -XX:+PrintPromotionFailure > -XX:+CMSConcurrentMTEnabled -XX:+ExplicitGCInvokesConcurrent > -XX:+SafepointTimeout -XX:MonitorBound=16384 -XX:-UseBiasedLocking > -verbose:gc -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps > {color:#000000}3. regionserver:{color} > {color:#000000}-Xmx10g -Xms10g -XX:MaxDirectMemorySize=10g > -XX:MaxGCPauseMillis=150 -XX:MaxTenuringThreshold=2 > -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=5 > -Xloggc:$run_dir/stdout/regionserver_gc_${start_time}.log -Xss256k > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$log_dir -verbose:gc > -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime > -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy > -XX:+PrintTenuringDistribution -XX:+PrintSafepointStatistics > -XX:PrintSafepointStatisticsCount=1 -XX:PrintFLSStatistics=1 > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=100 -XX:GCLogFileSize=128m > -XX:+SafepointTimeout -XX:MonitorBound=16384 -XX:-UseBiasedLocking > -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=65 > -XX:+ParallelRefProcEnabled -XX:ConcGCThreads=4 -XX:ParallelGCThreads=16 > -XX:G1HeapRegionSize=32m -XX:G1MixedGCCountTarget=64 > -XX:G1OldCSetRegionThresholdPercent=5{color} > {color:#000000}block cache is disabled:{color}{color:#000000} <property> > <name>hbase.bucketcache.size</name> > <value>0.9</value> > </property>{color} > > Reporter: Gang Xie > Assignee: Lisheng Sun > Priority: Major > Attachments: HDFS-13639-2.4.diff, HDFS-13639.001.patch, > HDFS-13639.002.patch, ShortCircuitCache_new_slotReleaser.diff, > perf_after_improve_SlotReleaser.png, perf_before_improve_SlotReleaser.png > > > When test the performance of the ShortCircuit Read of the HDFS with YCSB, we > find that SlotReleaser of the ShortCircuitCache has some performance issue. > The problem is that, the qps of the slot releasing could only reach to 1000+ > while the qps of the slot allocating is ~3000. This means that the replica > info on datanode could not be released in time, which causes a lot of GCs and > finally full GCs. > > The fireflame graph shows that SlotReleaser spends a lot of time to do domain > socket connecting and throw/catching the exception when close the domain > socket and its streams. It doesn't make any sense to do the connecting and > closing each time. Each time when we connect to the domain socket, Datanode > allocates a new thread to free the slot. There are a lot of initializing > work, and it's costly. We need reuse the domain socket. > > After switch to reuse the domain socket(see diff attached), we get great > improvement(see the perf): > # without reusing the domain socket, the get qps of the YCSB getting worse > and worse, and after about 45 mins, full GC starts. When we reuse the domain > socket, no full GC found, and the stress test could be finished smoothly, the > qps of allocating and releasing match. > # Due to the datanode young GC, without the improvement, the YCSB get qps is > even smaller than the one with the improvement, ~3700 VS ~4200. > The diff is against 2.4, and I think this issue exists till latest version. I > doesn't have test env with 2.7 and higher version. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org