Thanks.  I'm already turned off. :)  Thanks for the quick advice, Amandeep & 
Ryan! (saw that 1M inserts/sec, impressive)

Otis




----- Original Message ----
> From: Ryan Rawson <[email protected]>
> To: [email protected]
> Sent: Wed, January 13, 2010 11:35:12 PM
> Subject: Re: MR on HDFS data inserted via HBase?
> 
> Hey,
> 
> It isnt just as simple as 'read HBase's files'.  You will also need:
> - data that is only available in memory of the regionserver
> - merge multiple HFiles
> - do delete processing, etc, ie: reproduce the Regionserver read path
> 
> Due to #1, I don't feel like this is a particularly fruitful avenue of
> approach.
> 
> -ryan
> 
> On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic
> wrote:
> > Hello,
> >
> >
> > ----- Original Message ----
> >
> >> From: Amandeep Khurana 
> >
> >> HBase has its own file format. Reading data from it in your own job will 
> >> not
> >> be trivial to write, but not impossible.
> >
> > You are referring to HTable, HFile, etc.?
> >
> >> Why would you want to use the underlying data files in the MR jobs? Any
> >> limitation in using the HBase api?
> >
> > Are you referring to writing a MR job that makes use of TableInputFormat 
> > and 
> TableOutputFormat as mentioned on 
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink
>  
> ?
> >
> > I think that would work.
> >
> > But I'd also like to be able to run Hive/Pig scripts over the data, and I 
> *think* neither support reading it from HBase.  But they can obviously read 
> it 
> from files in HDFS, that's why I was asking.  But it sounds like anything 
> wanting to read HBase's data without going through the HBase's API and 
> reading 
> from behind its back would have to know how to read from HFile & friends?
> > (and again, I think/assume Hive and Pig don't know how to do that)
> >
> > Thanks,
> > Otis
> >
> >> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
> >> [email protected]> wrote:
> >>
> >> > Hello,
> >> >
> >> > If I import data into HBase, can I still run a hand-written MapReduce job
> >> > over that data in HDFS?
> >> > That is, not using TableInputFormat to read the data back out via HBase.
> >> >
> >> > Similarly, can one run Hive or Pig scripts against that data, but again,
> >> > without Hive or Pig reading the data via HBase, but rather getting to it
> >> > directly via HDFS?  I'm asking because I'm wondering whether storing 
> >> > data 
> in
> >> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
> >> >
> >> > Thanks,
> >> > Otis
> >> > --
> >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >> >
> >> >
> >
> >

Reply via email to