[ https://issues.apache.org/jira/browse/HDFS-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654107#comment-14654107 ]
Bob Hansen commented on HDFS-8855: ---------------------------------- Reproducer script: {code} #!/bin/bash # Check that the hadoop command is available hadoop fs -help > /dev/null 2> /dev/null if [ $? != 0 ]; then echo "The hadoop command must be in your path" exit 1 fi # segment, op=OPEN and offset are added to url_base file_size=${file_size:-$[ 1024 * 1024 * 1024 ]} count=${count:-1000000} reads_per_pass=${reads_per_pass:-1000} webhdfs_namenode=${webhdfs_namenode:-"localhost:50070"} read_size=${read_size:-64000} concurrent_reads=${concurrent_reads:-50} url_base="http://"$webhdfs_namenode"/webhdfs/v1/tmp/bigfile_$$" passes=$[ $count / $reads_per_pass ] url_list_file=/tmp/file_list_$$.txt namenode=${namenode:-`echo $url_base | grep -Po "(?<=http://)[^:/]*"`} echo "Environment settings:" echo " file_size=$file_size" echo " count=$count" echo " reads_per_pass=$reads_per_pass" echo " webhdfs_namenode=$webhdfs_namenode" echo " read_size=$read_size" echo " concurrent_reads=$concurrent_reads" echo "Outputs in /tmp/curl_[out|err]_$$" echo "Computed values:" echo " url_base=$url_base" echo " passes=$passes" echo " url_list_file=$url_list_file" echo " namenode=$namenode" echo echo "Copying temp data..." blocks_to_copy=$[ ( $file_size + 1023 ) / 1024 ] dd count=$blocks_to_copy bs=1024 if=/dev/zero | tr "\0" "+" | hadoop fs -copyFromLocal - /tmp/bigfile_$$ echo "Generating URL list..." # Generate the load profile rm -f $url_list_file for j in `seq 1 $reads_per_pass`; do rand=$(od -N 4 -t uL -An /dev/urandom | tr -d " ") offset=$[ ( $rand % (file_size / read_size) * read_size )] url=$url_base?op=OPEN\&user.name=$USER\&offset=$offset\&length=$read_size echo url = \"$url\" >> $url_list_file done # Open $concurrent_reads files and do $reads_per_pass random reads of $read_size for i in `seq 1 $passes` ; do # Kick off concurrent random reads for k in `seq 1 $concurrent_reads`; do curl -v -L -K $url_list_file > /tmp/curl_out_$$-$k.txt 2>/tmp/curl_err_$$-$k.txt & done # Wait for all curl jobs to finish while [ `jobs | grep "Running.*curl" | wc -l` != 0 ]; do sleep 1s # Every second, count the connections on the webhdfs_namenode ssh $namenode "file=/tmp/netstat.out_\$\$ ; netstat -an > \$file ; echo -n 'ESTABLISHED: '; echo -n \`grep -c ESTABLISHED \$file\` ; echo -n ' TIME_WAIT: '; echo -n \`grep -c TIME_WAIT \$file\` ; echo -n ' CLOSE_WAIT: '; grep -c CLOSE_WAIT \$file; rm \$file"& echo `grep "HTTP/1.1 [^23]" /tmp/curl_err_$$-* | wc -l` errors, "`grep "HTTP/1.1 200" /tmp/curl_err_$$-* | wc -l`" successes done # Display the completion time echo -n "Pass $i "; date +%H:%M:%S.%N echo Total: `grep "HTTP/1.1 [^23]" /tmp/curl_err_$$-* | wc -l` errors, "`grep "HTTP/1.1 200" /tmp/curl_err_$$-* | wc -l`" successes # sleep $delay done {code} > Webhdfs client leaks active NameNode connections > ------------------------------------------------ > > Key: HDFS-8855 > URL: https://issues.apache.org/jira/browse/HDFS-8855 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs > Environment: HDP 2.2 > Reporter: Bob Hansen > > The attached script simulates a process opening ~50 files via webhdfs and > performing random reads. Note that there are at most 50 concurrent reads, > and all webhdfs sessions are kept open. Each read is ~64k at a random > position. > The script periodically (once per second) shells into the NameNode and > produces a summary of the socket states. For my test cluster with 5 nodes, > it took ~30 seconds for the NameNode to have ~25000 active connections and > fails. > It appears that each request to the webhdfs client is opening a new > connection to the NameNode and keeping it open after the request is complete. > If the process continues to run, eventually (~30-60 seconds), all of the > open connections are closed and the NameNode recovers. > This smells like SoftReference reaping. Are we using SoftReferences in the > webhdfs client to cache NameNode connections but never re-using them? -- This message was sent by Atlassian JIRA (v6.3.4#6332)