My apologies for sending this message to this group, but I'm having trouble 
sending to the right group.


________________________________
From: Steven Wong
Sent: Wednesday, January 23, 2013 11:15 AM
To: impala-u...@cloudera.org
Subject: RE: Need help with cluster setup for performance

Thanks for the suggestions. The /metrics output looks good now, and the SELECT 
COUNT(*) runs much faster than before.

But I still have the "Unknown disk id" error message. My CDH version is:

 hadoop-client        x86_64 2.0.0+552-1.cdh4.1.2.p0.27.el5 cloudera-cdh4  18 k
 hadoop-mapreduce     x86_64 2.0.0+552-1.cdh4.1.2.p0.27.el5 cloudera-cdh4 9.8 M
 hadoop-yarn          x86_64 2.0.0+552-1.cdh4.1.2.p0.27.el5 cloudera-cdh4 8.9 M



On Tuesday, January 22, 2013 5:37:30 PM UTC-8, Henry wrote:
On 22 January 2013 11:40, Steven Wong <sw...@netflix.com> wrote:
Hi,

I followed http://zenfractal.com/2012/11/15/from-zero-to-impala-in-minutes/ to 
set up a cluster on EC2. After seeing disappointing performance numbers from a 
SELECT COUNT(*), I am following 
https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance#ConfiguringImpalaforPerformance-TestingImpalaforHighPerformanceConfiguration
 to check my cluster setup. Questions:

1. My cluster has 3 data nodes. Is the following 
http://<hostname>:<port>/metrics output good?

statestore.backend.state.map:
{
  127.0.0.1:23000<http://127.0.0.1:23000/> : OK
}
statestore.live.backends:3
statestore.live.backends.list:[127.0.0.1:22000<http://127.0.0.1:22000/>]


Hi Steven -

This looks like your problem. Your machines are registering themselves with 
'localhost' as their hostname, and this means that they all look the same to 
the statestore.

I looked at Matt's zero-to-impala link - it's awesome, but now a little out of 
date. You should modify where you run impalad to also have --ipaddress and 
--hostname correctly set for each node. Then check the statestore metrics; 
things should look a lot better and your performance should improve.


2. My impalad logs contain "Unknown disk id.  This will negatively affect 
performance.  Check your hdfs settings to enable block location metadata." and 
my http://<hostname>:<port>/varz doesn't contain the string 
"dfs.datanode.hdfs-blocks-metadata.enabled". But my hdfs-site.xml sets 
dfs.datanode.hdfs-blocks-metadata.enabled to true. Why?

What version of CDH are you using?


3. My impalad.out doesn't contain "Unable to load native-hadoop library". This 
is good, I believe.

4. My impalad logs contain the following lines matching the word "scheduler", 
but none contains "locality percentage". Why?


The locality percentage is printed only for GLOG_v=1 - and I note that the 
setup-impala.sh script has  a typo where it has GVLOG_v=1. If you fix this, you 
should see the locality percentage.

Hope this helps - let us know if things improve.

Henry


/tmp/impalad.INFO:I0122 00:19:09.137197  5121 simple-scheduler.cc:82] Starting 
simple scheduler
/tmp/impalad.ip-10-170-17-154.impala.log.INFO.20130122-001901.5121:I0122 
00:19:09.137197  5121 simple-scheduler.cc:82] Starting simple scheduler

Thanks.
Steven


--





--
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Reply via email to