Hey, It isnt just as simple as 'read HBase's files'. You will also need: - data that is only available in memory of the regionserver - merge multiple HFiles - do delete processing, etc, ie: reproduce the Regionserver read path
Due to #1, I don't feel like this is a particularly fruitful avenue of approach. -ryan On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic <[email protected]> wrote: > Hello, > > > ----- Original Message ---- > >> From: Amandeep Khurana <[email protected]> > >> HBase has its own file format. Reading data from it in your own job will not >> be trivial to write, but not impossible. > > You are referring to HTable, HFile, etc.? > >> Why would you want to use the underlying data files in the MR jobs? Any >> limitation in using the HBase api? > > Are you referring to writing a MR job that makes use of TableInputFormat and > TableOutputFormat as mentioned on > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink > ? > > I think that would work. > > But I'd also like to be able to run Hive/Pig scripts over the data, and I > *think* neither support reading it from HBase. But they can obviously read > it from files in HDFS, that's why I was asking. But it sounds like anything > wanting to read HBase's data without going through the HBase's API and > reading from behind its back would have to know how to read from HFile & > friends? > (and again, I think/assume Hive and Pig don't know how to do that) > > Thanks, > Otis > >> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic < >> [email protected]> wrote: >> >> > Hello, >> > >> > If I import data into HBase, can I still run a hand-written MapReduce job >> > over that data in HDFS? >> > That is, not using TableInputFormat to read the data back out via HBase. >> > >> > Similarly, can one run Hive or Pig scripts against that data, but again, >> > without Hive or Pig reading the data via HBase, but rather getting to it >> > directly via HDFS? I'm asking because I'm wondering whether storing data >> > in >> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs. >> > >> > Thanks, >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch >> > >> > > >
