RE: Loading data, hbase slower than Hive?

Anoop Sam John Sun, 20 Jan 2013 22:37:41 -0800

@Mohammad 
As he is using HFileOutputFormat, there is no put call happening on HTable. In 
this case the MR will create the HFiles directly with out using the normal 
HBase write path. Then later using HRS API the HFiles are loaded to the table 
regions.
In this case the number of reducers will be that of the table regions. So 
Austin you can check with proper presplit of table.


-Anoop-
________________________________________
From: Mohammad Tariq [donta...@gmail.com]
Sent: Monday, January 21, 2013 12:01 PM
To: user@hbase.apache.org
Subject: Re: Loading data, hbase slower than Hive?

Apart from this you can have some additional tweaks to improve
put performance. Like, creating pre-splitted tables, making use of
put(List<Put> puts) instead of normal put etc.


Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath <austi...@gmail.com>wrote:

> Anoop,
>
> I am using HFileOutputFormat. I am doing nothing but splitting the data
> from each row by the delimiter and sending it into their respective
> columns.
> Is there some kind of preprocessing or steps that I should do before this?
> As suggested I will look into the above solutions and let you guys know
> what the problem was. I might have to rethink the Rowkey design.
>
> Regards,
> Austin.
>
> On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <anoo...@huawei.com>
> wrote:
>
> > Austin,
> >         You are using HFileOutputFormat or TableOutputFormat?
> >
> > -Anoop-
> > ________________________________________
> > From: Austin Chungath [austi...@gmail.com]
> > Sent: Monday, January 21, 2013 11:15 AM
> > To: user@hbase.apache.org
> > Subject: Re: Loading data, hbase slower than Hive?
> >
> > Thank you Tariq.
> > I will let you know how things went after I implement these suggestions.
> >
> > Regards,
> > Austin
> >
> > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <donta...@gmail.com>
> > wrote:
> >
> > > Hello Austin,
> > >
> > >           I am sorry for the late response.
> > >
> > > Asaf has made a very valid point. Rowkwey design is very crucial.
> > > Specially if the data is gonna be sequential(timeseries kinda thing).
> > > You may end up with hotspotting problem. Use pre-splitted tables
> > > or hash the keys to avoid that. It'll also allow you to fetch the
> results
> > > faster.
> > >
> > > Warm Regards,
> > > Tariq
> > > https://mtariq.jux.com/
> > > cloudfront.blogspot.com
> > >
> > >
> > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <asaf.mes...@gmail.com>
> > > wrote:
> > >
> > > > Start by telling us your row key design.
> > > > Check for pre splitting your table regions.
> > > > I managed to get to 25mb/sec write throughput in Hbase using 1 region
> > > > server. If your data is evenly spread you can get around 7 times that
> > in
> > > a
> > > > 10 regions server environment. Should mean that 1 gig should take 4
> > sec.
> > > >
> > > >
> > > > On Friday, January 18, 2013, praveenesh kumar wrote:
> > > >
> > > > > Hey,
> > > > > Can someone throw some pointers on what would be the best practice
> > for
> > > > bulk
> > > > > imports in hbase ?
> > > > > That would be really helpful.
> > > > >
> > > > > Regards,
> > > > > Praveenesh
> > > > >
> > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <
> donta...@gmail.com
> > > > <javascript:;>>
> > > > > wrote:
> > > > >
> > > > > > Just to add to whatever all the heavyweights have said above,
> your
> > MR
> > > > job
> > > > > > may not be as efficient as the MR job corresponding to your Hive
> > > query.
> > > > > You
> > > > > > can enhance the performance by setting the mapred config
> parameters
> > > > > wisely
> > > > > > and by tuning your MR job.
> > > > > >
> > > > > > Warm Regards,
> > > > > > Tariq
> > > > > > https://mtariq.jux.com/
> > > > > > cloudfront.blogspot.com
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan <
> > > > > > ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > > >
> > > > > > > Hive is more for batch and HBase is for more of real time data.
> > > > > > >
> > > > > > > Regards
> > > > > > > Ram
> > > > > > >
> > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <
> > > anoop.hb...@gmail.com
> > > > <javascript:;>
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > In case of Hive data insertion means placing the file under
> > table
> > > > > path
> > > > > > in
> > > > > > > > HDFS.  HBase need to read the data and convert it into its
> > > format.
> > > > > > > (HFiles)
> > > > > > > > MR is doing this work..  So this makes it clear that HBase
> will
> > > be
> > > > > > > slower.
> > > > > > > > :)  As Michael said the read operation...
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > -Anoop-
> > > > > > > >
> > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <
> > > > > austi...@gmail.com <javascript:;>
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > >   Hi,
> > > > > > > > > Problem: hive took 6 mins to load a data set, hbase took 1
> hr
> > > 14
> > > > > > mins.
> > > > > > > > > It's a 20 gb data set approx 230 million records. The data
> is
> > > in
> > > > > > hdfs,
> > > > > > > > > single text file. The cluster is 11 nodes, 8 cores.
> > > > > > > > >
> > > > > > > > > I loaded this in hive, partitioned by date and bucketed
> into
> > 32
> > > > and
> > > > > > > > sorted.
> > > > > > > > > Time taken is 6 mins.
> > > > > > > > >
> > > > > > > > > I loaded the same data into hbase, in the same cluster by
> > > > writing a
> > > > > > map
> > > > > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't
> running
> > > > > anything
> > > > > > > > else
> > > > > > > > > and assuming that the code that i wrote is good enough,
> what
> > is
> > > > it
> > > > > > that
> > > > > > > > > makes hbase slower than hive in loading the data?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Austin
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: Loading data, hbase slower than Hive?

Reply via email to