Thanks. I'm already turned off. :) Thanks for the quick advice, Amandeep & Ryan! (saw that 1M inserts/sec, impressive)
Otis ----- Original Message ---- > From: Ryan Rawson <[email protected]> > To: [email protected] > Sent: Wed, January 13, 2010 11:35:12 PM > Subject: Re: MR on HDFS data inserted via HBase? > > Hey, > > It isnt just as simple as 'read HBase's files'. You will also need: > - data that is only available in memory of the regionserver > - merge multiple HFiles > - do delete processing, etc, ie: reproduce the Regionserver read path > > Due to #1, I don't feel like this is a particularly fruitful avenue of > approach. > > -ryan > > On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic > wrote: > > Hello, > > > > > > ----- Original Message ---- > > > >> From: Amandeep Khurana > > > >> HBase has its own file format. Reading data from it in your own job will > >> not > >> be trivial to write, but not impossible. > > > > You are referring to HTable, HFile, etc.? > > > >> Why would you want to use the underlying data files in the MR jobs? Any > >> limitation in using the HBase api? > > > > Are you referring to writing a MR job that makes use of TableInputFormat > > and > TableOutputFormat as mentioned on > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink > > ? > > > > I think that would work. > > > > But I'd also like to be able to run Hive/Pig scripts over the data, and I > *think* neither support reading it from HBase. But they can obviously read > it > from files in HDFS, that's why I was asking. But it sounds like anything > wanting to read HBase's data without going through the HBase's API and > reading > from behind its back would have to know how to read from HFile & friends? > > (and again, I think/assume Hive and Pig don't know how to do that) > > > > Thanks, > > Otis > > > >> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < > >> [email protected]> wrote: > >> > >> > Hello, > >> > > >> > If I import data into HBase, can I still run a hand-written MapReduce job > >> > over that data in HDFS? > >> > That is, not using TableInputFormat to read the data back out via HBase. > >> > > >> > Similarly, can one run Hive or Pig scripts against that data, but again, > >> > without Hive or Pig reading the data via HBase, but rather getting to it > >> > directly via HDFS? I'm asking because I'm wondering whether storing > >> > data > in > >> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. > >> > > >> > Thanks, > >> > Otis > >> > -- > >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > >> > > >> > > > > >
