Re: Hbase import Tsv performance (slow import)

lars hofhansl Wed, 24 Oct 2012 21:11:02 -0700

This is good advice Kevin we should add this to the HBase Reference Guide.



________________________________
 From: Kevin O'dell <kevin.od...@cloudera.com>
To: user@hbase.apache.org 
Sent: Tuesday, October 23, 2012 10:47 AM
Subject: Re: Hbase import Tsv performance (slow import)
 
You will want to make sure your table is pre-split.  Also Import does
puts, so you will want to make sure you are not flushing and blocking
by raising your memstore, Hlog, and blocking count.  This can greatly
improve your write speeds.  I usually do a 256MB memstore(you can
lower it later if it is not a heavy writes table), 512MB Hlog(same
thing, you can lower back to default), and then raise the storefile
blocking count to about 100.

On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <nkey...@gmail.com> wrote:
> Thanks, checking the schema itself is still interesting (cf. the link sent)
> As well, with 3 machines and a replication factor of 3, all the machines
> are used during a write. As HBase writes all entries into a write-ahead-log
> for safety, the number of writes is also doubled. So may be your machine is
> just dying under the load. Anyway, here your cluster is going at the speed
> of the least powerful machine, and this machine has a workload multiplied
> by 6 compared to a single machine config (i.e. just writing a file locally).
>
> On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard <
> nicolas.maill...@fifty-five.com> wrote:
>
>> Thanks for the help!
>>
>> My conf files are : Hadoop:
>> hdfs-site
>>
>> <configuration>
>>  <property>
>>   <name>dfs.replication</name>
>>   <value>3</value
>>   <description>Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>   The default is used if replication is not specified in create time.
>>   </description>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/home/runner/app/hadoop/dfs/data</value>
>>   <description>Default block replication.
>>   The actual number of replications can be specified when the file is
>> created.
>>   The default is used if replication is not specified in create time.
>>   </description>
>> </property>
>> <property>
>>         <name>dfs.datanode.max.xcievers</name>
>>         <value>4096</value>
>>       </property>
>> </configuration>
>>
>>
>> Mapred-site.xml
>>
>> <configuration>
>>  <property>
>>   <name>mapred.job.tracker</name>
>>   <value>master:54311</value>
>>   <description>The host and port that the MapReduce job tracker runs
>>   at.  If "local", then jobs are run in-process as a single map
>>   and reduce task.
>>   </description>
>> </property>
>> <property>
>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>   <value>14</value>
>>   <description>The maximum number of map tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>>
>> <property>
>>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>   <value>14</value>
>>   <description>The maximum number of reduce tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>>   <value>-Xmx400m</value>
>>   <description>Java opts for the task tracker child processes.
>>   The following symbol, if present, will be interpolated: @taskid@ is
>> replaced
>>   by current TaskID. Any other occurrences of '@' will go unchanged.
>>   For example, to enable verbose gc logging to a file named for the taskid
>> in
>>   /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
>>         -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
>>
>>   The configuration variable mapred.child.ulimit can be used to control the
>>   maximum virtual memory of the child processes.
>>   </description>
>> </property>
>> </configuration>
>>
>>
>> core-site.xml
>>
>> <configuration>
>>  <property>
>>   <name>hadoop.tmp.dir</name>
>>   <value>/home/runner/app/hadoop/tmp</value>
>>   <description>A base for other temporary directories.</description>
>> </property>
>>
>> <property>
>>   <name>fs.default.name</name>
>>   <value>hdfs://master:54310</value>
>>   <description>The name of the default file system.  A URI whose
>>   scheme and authority determine the FileSystem implementation.  The
>>   uri's scheme determines the config property (fs.SCHEME.impl) naming
>>   the FileSystem implementation class.  The uri's authority is used to
>>   determine the host, port, etc. for a filesystem.</description>
>> </property>
>>
>>
>> For Hbase:
>> hbase-site:
>> <configuration>
>>  <property>
>>     <name>hbase.rootdir</name>
>>     <value>hdfs://master:54310/hbase</value>
>>  </property>
>>   <property>
>>     <name>hbase.cluster.distributed</name>
>>     <value>true</value>
>>     <description>The mode the cluster will be in. Possible values are
>>       false: standalone and pseudo-distributed setups with managed
>> Zookeeper
>>       true: fully-distributed with unmanaged Zookeeper Quorum (see
>> hbase-env.sh)
>>     </description>
>>   </property>
>> <property>
>>         <name>hbase.zookeeper.property.clientPort</name>
>>         <value>2222</value>
>>     </property>
>>     <property>
>>         <name>hbase.zookeeper.quorum</name>
>>         <value>ks25937.kimsufi.com</value>
>>     </property>
>>     <property>
>>         <name>hbase.zookeeper.property.dataDir</name>
>>         <value>/home/runner/hbase/hbase-0.94.2/tmp</value>
>>     </property>
>> </configuration>
>>
>>
>>
>>
>> I am currently running import and looking at the logs to try and understand
>> This seems definitely phishy:
>>
>> 2012-10-23 18:39:49,107 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201210231145_0010_m_000041_0 0.21332978%
>> 2012-10-23 18:39:50,363 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201210231145_0010_m_000028_0 0.20936884%
>> 2012-10-23 18:49:38,098 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201210231145_0010_m_000030_0: Task
>> attempt_201210231145_0010_m_000030_0
>> failed to report status for 602 seconds. Killing!
>> 2012-10-23 18:49:38,116 INFO org.apache.hadoop.mapred.TaskTracker: Process
>> Thread Dump: lost task
>> 90 active threads
>> Thread 742 (process reaper):
>>   State: RUNNABLE
>>   Blocked count: 0
>>   Waited count: 0
>>   Stack:
>>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>     java.lang.UNIXProcess.access$200(UNIXProcess.java:54)
>>     java.lang.UNIXProcess$3.run(UNIXProcess.java:174)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>     java.lang.Thread.run(Thread.java:722)
>> Thread 740 (process reaper):
>>   State: RUNNABLE
>>   Blocked count: 0
>>   Waited count: 0
>>   Stack:
>>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>>     java.lang.UNIXProcess.access$200(UNIXProcess.java:54)
>>     java.lang.UNIXProcess$3.run(UNIXProcess.java:174)
>>
>>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Hbase import Tsv performance (slow import)

Reply via email to