Hi there-

On top of what everybody else said, for more info on rowkey design and
pre-splitting see http://hbase.apache.org/book.html#schema (as well as
other threads in this dist-list on that topic).





On 1/19/13 4:12 PM, "Mohammad Tariq" <donta...@gmail.com> wrote:

>Hello Austin,
>
>          I am sorry for the late response.
>
>Asaf has made a very valid point. Rowkwey design is very crucial.
>Specially if the data is gonna be sequential(timeseries kinda thing).
>You may end up with hotspotting problem. Use pre-splitted tables
>or hash the keys to avoid that. It'll also allow you to fetch the results
>faster.
>
>Warm Regards,
>Tariq
>https://mtariq.jux.com/
>cloudfront.blogspot.com
>
>
>On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <asaf.mes...@gmail.com>
>wrote:
>
>> Start by telling us your row key design.
>> Check for pre splitting your table regions.
>> I managed to get to 25mb/sec write throughput in Hbase using 1 region
>> server. If your data is evenly spread you can get around 7 times that
>>in a
>> 10 regions server environment. Should mean that 1 gig should take 4 sec.
>>
>>
>> On Friday, January 18, 2013, praveenesh kumar wrote:
>>
>> > Hey,
>> > Can someone throw some pointers on what would be the best practice for
>> bulk
>> > imports in hbase ?
>> > That would be really helpful.
>> >
>> > Regards,
>> > Praveenesh
>> >
>> > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <donta...@gmail.com
>> <javascript:;>>
>> > wrote:
>> >
>> > > Just to add to whatever all the heavyweights have said above, your
>>MR
>> job
>> > > may not be as efficient as the MR job corresponding to your Hive
>>query.
>> > You
>> > > can enhance the performance by setting the mapred config parameters
>> > wisely
>> > > and by tuning your MR job.
>> > >
>> > > Warm Regards,
>> > > Tariq
>> > > https://mtariq.jux.com/
>> > > cloudfront.blogspot.com
>> > >
>> > >
>> > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan <
>> > > ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
>> > >
>> > > > Hive is more for batch and HBase is for more of real time data.
>> > > >
>> > > > Regards
>> > > > Ram
>> > > >
>> > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John
>><anoop.hb...@gmail.com
>> <javascript:;>
>> > >
>> > > > wrote:
>> > > >
>> > > > > In case of Hive data insertion means placing the file under
>>table
>> > path
>> > > in
>> > > > > HDFS.  HBase need to read the data and convert it into its
>>format.
>> > > > (HFiles)
>> > > > > MR is doing this work..  So this makes it clear that HBase will
>>be
>> > > > slower.
>> > > > > :)  As Michael said the read operation...
>> > > > >
>> > > > >
>> > > > >
>> > > > > -Anoop-
>> > > > >
>> > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <
>> > austi...@gmail.com <javascript:;>
>> > > > > >wrote:
>> > > > >
>> > > > > >   Hi,
>> > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr
>>14
>> > > mins.
>> > > > > > It's a 20 gb data set approx 230 million records. The data is
>>in
>> > > hdfs,
>> > > > > > single text file. The cluster is 11 nodes, 8 cores.
>> > > > > >
>> > > > > > I loaded this in hive, partitioned by date and bucketed into
>>32
>> and
>> > > > > sorted.
>> > > > > > Time taken is 6 mins.
>> > > > > >
>> > > > > > I loaded the same data into hbase, in the same cluster by
>> writing a
>> > > map
>> > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running
>> > anything
>> > > > > else
>> > > > > > and assuming that the code that i wrote is good enough, what
>>is
>> it
>> > > that
>> > > > > > makes hbase slower than hive in loading the data?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Austin
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>


Reply via email to