Re: Hbase import Tsv performance (slow import)

Nicolas Liochon Tue, 23 Oct 2012 10:33:06 -0700

Thanks, checking the schema itself is still interesting (cf. the link sent)
As well, with 3 machines and a replication factor of 3, all the machines
are used during a write. As HBase writes all entries into a write-ahead-log
for safety, the number of writes is also doubled. So may be your machine is
just dying under the load. Anyway, here your cluster is going at the speed
of the least powerful machine, and this machine has a workload multiplied
by 6 compared to a single machine config (i.e. just writing a file locally).


On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard <
nicolas.maill...@fifty-five.com> wrote:

> Thanks for the help!
>
> My conf files are : Hadoop:
> hdfs-site
>
> <configuration>
>  <property>
>   <name>dfs.replication</name>
>   <value>3</value
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
> <property>
>   <name>dfs.data.dir</name>
>   <value>/home/runner/app/hadoop/dfs/data</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
> <property>
>         <name>dfs.datanode.max.xcievers</name>
>         <value>4096</value>
>       </property>
> </configuration>
>
>
> Mapred-site.xml
>
> <configuration>
>  <property>
>   <name>mapred.job.tracker</name>
>   <value>master:54311</value>
>   <description>The host and port that the MapReduce job tracker runs
>   at.  If "local", then jobs are run in-process as a single map
>   and reduce task.
>   </description>
> </property>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>14</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>14</value>
>   <description>The maximum number of reduce tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
> <property>
> <name>mapred.child.java.opts</name>
>   <value>-Xmx400m</value>
>   <description>Java opts for the task tracker child processes.
>   The following symbol, if present, will be interpolated: @taskid@ is
> replaced
>   by current TaskID. Any other occurrences of '@' will go unchanged.
>   For example, to enable verbose gc logging to a file named for the taskid
> in
>   /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
>         -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc
>
>   The configuration variable mapred.child.ulimit can be used to control the
>   maximum virtual memory of the child processes.
>   </description>
> </property>
> </configuration>
>
>
> core-site.xml
>
> <configuration>
>  <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/home/runner/app/hadoop/tmp</value>
>   <description>A base for other temporary directories.</description>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>   <value>hdfs://master:54310</value>
>   <description>The name of the default file system.  A URI whose
>   scheme and authority determine the FileSystem implementation.  The
>   uri's scheme determines the config property (fs.SCHEME.impl) naming
>   the FileSystem implementation class.  The uri's authority is used to
>   determine the host, port, etc. for a filesystem.</description>
> </property>
>
>
> For Hbase:
> hbase-site:
> <configuration>
>  <property>
>     <name>hbase.rootdir</name>
>     <value>hdfs://master:54310/hbase</value>
>  </property>
>   <property>
>     <name>hbase.cluster.distributed</name>
>     <value>true</value>
>     <description>The mode the cluster will be in. Possible values are
>       false: standalone and pseudo-distributed setups with managed
> Zookeeper
>       true: fully-distributed with unmanaged Zookeeper Quorum (see
> hbase-env.sh)
>     </description>
>   </property>
> <property>
>         <name>hbase.zookeeper.property.clientPort</name>
>         <value>2222</value>
>     </property>
>     <property>
>         <name>hbase.zookeeper.quorum</name>
>         <value>ks25937.kimsufi.com</value>
>     </property>
>     <property>
>         <name>hbase.zookeeper.property.dataDir</name>
>         <value>/home/runner/hbase/hbase-0.94.2/tmp</value>
>     </property>
> </configuration>
>
>
>
>
> I am currently running import and looking at the logs to try and understand
> This seems definitely phishy:
>
> 2012-10-23 18:39:49,107 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201210231145_0010_m_000041_0 0.21332978%
> 2012-10-23 18:39:50,363 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201210231145_0010_m_000028_0 0.20936884%
> 2012-10-23 18:49:38,098 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201210231145_0010_m_000030_0: Task
> attempt_201210231145_0010_m_000030_0
> failed to report status for 602 seconds. Killing!
> 2012-10-23 18:49:38,116 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
> 90 active threads
> Thread 742 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$200(UNIXProcess.java:54)
>     java.lang.UNIXProcess$3.run(UNIXProcess.java:174)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>     java.lang.Thread.run(Thread.java:722)
> Thread 740 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$200(UNIXProcess.java:54)
>     java.lang.UNIXProcess$3.run(UNIXProcess.java:174)
>
>

Re: Hbase import Tsv performance (slow import)

Reply via email to