Hey,

It isnt just as simple as 'read HBase's files'.  You will also need:
- data that is only available in memory of the regionserver
- merge multiple HFiles
- do delete processing, etc, ie: reproduce the Regionserver read path

Due to #1, I don't feel like this is a particularly fruitful avenue of
approach.

-ryan

On Wed, Jan 13, 2010 at 8:28 PM, Otis Gospodnetic
<[email protected]> wrote:
> Hello,
>
>
> ----- Original Message ----
>
>> From: Amandeep Khurana <[email protected]>
>
>> HBase has its own file format. Reading data from it in your own job will not
>> be trivial to write, but not impossible.
>
> You are referring to HTable, HFile, etc.?
>
>> Why would you want to use the underlying data files in the MR jobs? Any
>> limitation in using the HBase api?
>
> Are you referring to writing a MR job that makes use of TableInputFormat and 
> TableOutputFormat as mentioned on 
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink
>  ?
>
> I think that would work.
>
> But I'd also like to be able to run Hive/Pig scripts over the data, and I 
> *think* neither support reading it from HBase.  But they can obviously read 
> it from files in HDFS, that's why I was asking.  But it sounds like anything 
> wanting to read HBase's data without going through the HBase's API and 
> reading from behind its back would have to know how to read from HFile & 
> friends?
> (and again, I think/assume Hive and Pig don't know how to do that)
>
> Thanks,
> Otis
>
>> On Wed, Jan 13, 2010 at 8:06 PM, Otis Gospodnetic <
>> [email protected]> wrote:
>>
>> > Hello,
>> >
>> > If I import data into HBase, can I still run a hand-written MapReduce job
>> > over that data in HDFS?
>> > That is, not using TableInputFormat to read the data back out via HBase.
>> >
>> > Similarly, can one run Hive or Pig scripts against that data, but again,
>> > without Hive or Pig reading the data via HBase, but rather getting to it
>> > directly via HDFS?  I'm asking because I'm wondering whether storing data 
>> > in
>> > HBase means I can no longer use Hive and Pig to run my ad-hoc jobs.
>> >
>> > Thanks,
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >
>> >
>
>

Reply via email to