Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Nicolas Liochon
Hi, The schema design is important. There is this entry to look at at least: http://hbase.apache.org/book.html#rowkey.design For the config, could you pastebin the hdfs & hbase config files you used? N. On Tue, Oct 23, 2012 at 5:48 PM, Nick maillard < nicolas.maill...@fifty-five.com> wrote: > H

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Nick maillard
Thanks for the help! My conf files are : Hadoop: hdfs-site dfs.replication 3 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. dfs.data.dir /home/runner/

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Nicolas Liochon
Thanks, checking the schema itself is still interesting (cf. the link sent) As well, with 3 machines and a replication factor of 3, all the machines are used during a write. As HBase writes all entries into a write-ahead-log for safety, the number of writes is also doubled. So may be your machine i

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Kevin O'dell
You will want to make sure your table is pre-split. Also Import does puts, so you will want to make sure you are not flushing and blocking by raising your memstore, Hlog, and blocking count. This can greatly improve your write speeds. I usually do a 256MB memstore(you can lower it later if it is

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Anoop John
Hi Using ImportTSV tool you are trying to bulk load your data. Can you see and tell how many mappers and reducers were there. Out of total time what is the time taken by the mapper phase and by the reducer phase. Seems like MR related issue (may be some conf issue). In this bulk load case most

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread ramkrishna vasudevan
As Kevin suggested we can make use of bulk load that goes thro WAL and Memstore. Or the second option will be to use the o/p of mappers to create HFiles directly. Regards Ram On Wed, Oct 24, 2012 at 8:59 AM, Anoop John wrote: > Hi > Using ImportTSV tool you are trying to bulk load your dat

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread anil gupta
Hi Anoop, As per your last email, did you mean that WAL is not used while using HBase Bulk Loader? If yes, then how we ensure "no data loss" in case of RegionServer failure? Thanks, Anil Gupta On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < ramkrishna.s.vasude...@gmail.com> wrote: > As

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Anoop John
Hi Anil On Wed, Oct 24, 2012 at 10:39 AM, anil gupta wrote: > Hi Anoop, > > As per your last email, did you mean that WAL is not used while using HBase > Bulk Loader? If yes, then how we ensure "no data loss" in case of > RegionServer failure? > > Thanks, > Anil Gupta > > On Tue, Oct 23, 2012 a

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Anoop John
Hi Anil In case of bulk loading it is not like data is put into HBase one by one.. The MR job will create an o/p like HFile.. It will create the KVs and write to file in order as how HFile will look like.. The the file is loaded into HBase finally.. Only for this final step HBase RS

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread anil gupta
That's a very interesting fact. You made it clear but my custom Bulk Loader generates an unique ID for every row in map phase. So, all my data is not in csv or text. Is there a way that i can explicitly turn on WAL for bulk loading? On Tue, Oct 23, 2012 at 10:14 PM, Anoop John wrote: > Hi Anil >

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread ramkrishna vasudevan
Anil, When you do ImportTSV the data that is present in the the TSV file alone will be parsed and loaded into HBase. How are you planning to generate the UniqueID? Your usecase seems like it your data is in CSV file but the unique id that you need is not part of the TSV. Now you need them to be loa

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Anoop John
>. Is there a way that i can explicitly turn on WAL for bulk loading? no.. How you generate the unique id? Remember that initial steps wont need the HBase cluster at all. MR generates the HFiles and the o/p will be in file only.. Mappers also will write o/p to file... Only thing is that some map

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread anil gupta
Yes, the uniqueId is not part of csv file. In my bulk loader i use combination of nodeId+processId+counter as UniqueID for each row. I have to use the uniqueId since the remaining part of rowkey is not unique. I think there are two approaches to solve this problem: 1. Generate HFiles through MR an

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread anil gupta
Anoop: Only thing is that some mappers crashed.. So thin MR fw will run that mapper again on the same data set.. Then the unique id will be different? Anil: Yes, for the same dataset also the UniqueId will be different. UniqueID does not depends on the data. Thanks, Anil Gupta On Tue, Oct 23, 20

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread Anoop John
I think as per your explanation of need for unique id it is okey.. No need to worry abt data loss.. As long as you can make sure you make a unique id things are fine.. MR will make sure it run the job on whole data and the o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally th

Re: Hbase import Tsv performance (slow import)

2012-10-23 Thread anil gupta
Yeah, we never used HBase client api(puts) for loading a batch of millions of records. Can you tell me by default where the o/p HFile(s) from MR job are stored in HDFS? On Tue, Oct 23, 2012 at 11:31 PM, Anoop John wrote: > I think as per your explanation of need for unique id it is okey.. No ne

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Nick maillard
Thanks for your help I have taken my replication down to 2 but If I am not mistaken replication also has the benefit of rendering the cluster more fault by duplicating info on different nodes so that if one goes down data is note necessarily lost. I such case i would like to keep it a least at 2.

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Nick maillard
Hi John I have 42 map tasks capacity and running an avg tasks/nodes 28. when I check the map job details there are 80 tasks to complete. As i drill down on the different map tasks in task detail they all take a very long time (26 minutes) to complete. A lot of them fail as well. Fail info is "fai

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Nick maillard
As I have written in a reply above but that is kind of lost in the tread: I have set dfs.replication at 2 but this process time has not changed at all. How could I change my configuration to avoid this hotspot issue you have talked about. As Kevin has advised I have also upped: hbase.hstore.block

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Sonal Goyal
Hi Nick, Do you see anything in your tasktracker or datanode logs? Best Regards, Sonal Crux: Reporting for HBase Nube Technologies On Wed, Oct 24, 2012 at 3:45 PM, Nick maillard < nicolas.ma

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Nick maillard
Looking my task logs there is a big gap in time I do not understand. The task connects to zookeeper creates the entries and from: 2012-10-24 12:25:24 to 2012-10-24 13:08:03 logs nothing. Doing map reduce I guess. 2012-10-24 12:25:23,323 INFO org.apache.zookeeper.ClientCnxn: Sessionestablishment

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Nick maillard
Hello everyone Still looking in the issue. I have tried different tests and the results are surprising. If I put mapred.tasktracker.map.tasks.maximum: 28 I get a total of 84 tasks on my cluster and the process takes about 1h15 min each task taking up 1h10 minutes. The whole file being cut down in

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Kevin O'dell
Nick, What versions are you using: HDFS HBase OS On Oct 24, 2012 10:36 AM, "Nick maillard" wrote: > Hello everyone > > Still looking in the issue. > I have tried different tests and the results are surprising. > If I put mapred.tasktracker.map.tasks.maximum: 28 > I get a total of 84 tasks on

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread Nick maillard
Hello Kevin I'm using : Hadoop 1.0.3 Hbase 0.94.2 OS:ubuntu 12.04

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread anil gupta
Hi Nick, How many hard drives your slaves has? RPM of those? How many mappers are run concurrently on a node?Did you turn off speculative execution? Have a look at disk i/o to see whether that is a bottleneck or not. MR is disk I/O bound so if you only have one disk on slave and you are running 5

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread nick maillard
hi anil I have one hard drive per slave. I have tested with 3 concurrent mappers and 28 concurrent mappers per slave. And both times the total time was about 1 hour the only difference was the time each map took aka respectfully 40min and 1h10min I have turned of the speculative execution. I'll

Re: Hbase import Tsv performance (slow import)

2012-10-24 Thread lars hofhansl
This is good advice Kevin we should add this to the HBase Reference Guide. From: Kevin O'dell To: user@hbase.apache.org Sent: Tuesday, October 23, 2012 10:47 AM Subject: Re: Hbase import Tsv performance (slow import) You will want to make sure your tab

Re: Hbase import Tsv performance (slow import)

2012-10-25 Thread Jonathan Bishop
Nicolas, I just went through the same exercise. There are many ways to get this to go faster, but eventually I decided that bulk loading is the best solution as run times scaled with the number machines in my cluster when I used that approach. One thing you can try is to turn off hbase's write ah

Re: Hbase import Tsv performance (slow import)

2012-10-25 Thread anil gupta
Hi Nicolas, As per my experience you wont get good performance if you run 3 Map task simultaneously on one Hard Drive. That seems like a lot of I/O on one disk. HBase performs well when you have at least 5 nodes in cluster. So, running HBase on 3 nodes is not something you would do in prod. Than

Re: Hbase import Tsv performance (slow import)

2012-10-25 Thread anil gupta
@Jonathan, As per Anoop and Ram, WAL is not used with bulk loading so turning off WAL wont have any impact on performance. On Thu, Oct 25, 2012 at 1:33 PM, anil gupta wrote: > Hi Nicolas, > > As per my experience you wont get good performance if you run 3 Map task > simultaneously on one Hard D

RE: Hbase import Tsv performance (slow import)

2012-10-25 Thread Anoop Sam John
of write to HFile and upload at one shot, puts data into HTable calling put() method... -Anoop- From: anil gupta [anilgupt...@gmail.com] Sent: Friday, October 26, 2012 2:05 AM To: user@hbase.apache.org Subject: Re: Hbase import Tsv performance (slow imp