Hi there- On top of what everybody else said, for more info on rowkey design and pre-splitting see http://hbase.apache.org/book.html#schema (as well as other threads in this dist-list on that topic).
On 1/19/13 4:12 PM, "Mohammad Tariq" <donta...@gmail.com> wrote: >Hello Austin, > > I am sorry for the late response. > >Asaf has made a very valid point. Rowkwey design is very crucial. >Specially if the data is gonna be sequential(timeseries kinda thing). >You may end up with hotspotting problem. Use pre-splitted tables >or hash the keys to avoid that. It'll also allow you to fetch the results >faster. > >Warm Regards, >Tariq >https://mtariq.jux.com/ >cloudfront.blogspot.com > > >On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <asaf.mes...@gmail.com> >wrote: > >> Start by telling us your row key design. >> Check for pre splitting your table regions. >> I managed to get to 25mb/sec write throughput in Hbase using 1 region >> server. If your data is evenly spread you can get around 7 times that >>in a >> 10 regions server environment. Should mean that 1 gig should take 4 sec. >> >> >> On Friday, January 18, 2013, praveenesh kumar wrote: >> >> > Hey, >> > Can someone throw some pointers on what would be the best practice for >> bulk >> > imports in hbase ? >> > That would be really helpful. >> > >> > Regards, >> > Praveenesh >> > >> > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <donta...@gmail.com >> <javascript:;>> >> > wrote: >> > >> > > Just to add to whatever all the heavyweights have said above, your >>MR >> job >> > > may not be as efficient as the MR job corresponding to your Hive >>query. >> > You >> > > can enhance the performance by setting the mapred config parameters >> > wisely >> > > and by tuning your MR job. >> > > >> > > Warm Regards, >> > > Tariq >> > > https://mtariq.jux.com/ >> > > cloudfront.blogspot.com >> > > >> > > >> > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < >> > > ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote: >> > > >> > > > Hive is more for batch and HBase is for more of real time data. >> > > > >> > > > Regards >> > > > Ram >> > > > >> > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John >><anoop.hb...@gmail.com >> <javascript:;> >> > > >> > > > wrote: >> > > > >> > > > > In case of Hive data insertion means placing the file under >>table >> > path >> > > in >> > > > > HDFS. HBase need to read the data and convert it into its >>format. >> > > > (HFiles) >> > > > > MR is doing this work.. So this makes it clear that HBase will >>be >> > > > slower. >> > > > > :) As Michael said the read operation... >> > > > > >> > > > > >> > > > > >> > > > > -Anoop- >> > > > > >> > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < >> > austi...@gmail.com <javascript:;> >> > > > > >wrote: >> > > > > >> > > > > > Hi, >> > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr >>14 >> > > mins. >> > > > > > It's a 20 gb data set approx 230 million records. The data is >>in >> > > hdfs, >> > > > > > single text file. The cluster is 11 nodes, 8 cores. >> > > > > > >> > > > > > I loaded this in hive, partitioned by date and bucketed into >>32 >> and >> > > > > sorted. >> > > > > > Time taken is 6 mins. >> > > > > > >> > > > > > I loaded the same data into hbase, in the same cluster by >> writing a >> > > map >> > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running >> > anything >> > > > > else >> > > > > > and assuming that the code that i wrote is good enough, what >>is >> it >> > > that >> > > > > > makes hbase slower than hive in loading the data? >> > > > > > >> > > > > > Thanks, >> > > > > > Austin >> > > > > > >> > > > > >> > > > >> > > >> > >>