Hi Stack,
Thanks for taking a look!

We have an apache license file in the root of the project; I am not sure if
we need to put it in every file. Will check with the lawyers.

Regarding the first and last slice, the problem is that I have no way of
knowing what the first and last, respectively, key values are. With the
first slice I can maybe cache the first key I see, and use that in
conjunction with the end of the region to calculate the size of the
keyspace; but with that last region, the max is infinity, so I can't really
estimate how much more I have left until I have none.. do regions store any
metadata that countains a rough count of the number of records they hold? I
guess they only keep track of the byte size of the data, not the number of
records per se.   Maybe I can get the total byte size of the region, and
calculate offsets based on the size of the returned data? This would be
likely wrong due to pushed down projections and filters, of course. Any
other ideas? How do people normally handle this when writing regular MR jobs
that scan HBase tables?

I suspect this is actually a bit of a problem, btw -- since I don't report
the amount of remaining work for these slices accurately, and I (hopefully)
do a reasonable job for the ones where I can calculate the size of the
keyspace, speculative execution may get overeager with these two slices.

Will take care of the HBaseConfiguration thing.

PigCounterHelper just deals with some oddities of Hadoop counters (they may
not be available when you first try to increment a counter -- the helper
buffers increment requests until the reporter becomes available). Are HBase
counters special things or also just Hadoop counters under the covers?

The lzo files are probably unrelated.. there shouldn't be anything
LZO-specific in the HBase code. We are, in fact, lzo'ing hbase content in
the sense that that's the compression we have for HDFS, and I think HBase is
supposed to inherit that. But that's at the HDFS level. The ElephantBird
library is a grab-bag of all kinds of hadoop stuff, so you can find lots of
things unrelated to HBase in there, mostly dealing with protocol buffers and
LZO compression.

The only 2 HBase files are HBaseLoader and HBaseSlice.

Thanks again for checking it out!

-Dmitriy

On Mon, May 3, 2010 at 8:48 PM, Stack <st...@duboce.net> wrote:

> Hey Dmitry:
>
> I took a quick look.
>
> Your files are missing a copyright?
>
> I like your using of BinaryComparatory and the lte, gte, options in
> skipRegion setting up filters.
>
> Regards:
>
> "    // No way to know max.. just return 0. Sorry, reporting on the
> last slice is janky.
>    // So is reporting on the first slice, by the way -- it will start
> out too high, possibly at 100%.
>    if (endRow_.length==0) return 0;
> "
>
>
> ...if your keys are kinda regular, you might be able to do better in a
> slice.  See in Bytes where there are methods that do BigDecimal math.
> You can ask them to divide the slice.  Might work.  Then you could do
> progress (Looks like you are doing some later in the file -- does it
> work?).
>
> Try to use the same version of "   HBaseConfiguration conf = new
> HBaseConfiguration();" throughout rather than create a new one each
> time.  Can be more costly.
>
> Whats this?
>
> if (counterHelper_ == null) counterHelper_ = new PigCounterHelper();
>
> A pig counter? You don't want to use hbase counters?
>
> Whats the lzo stuff about?  It seems to be for loading files.  Are you
> lzo'ing your hbase content?
>
> Oh man ... base64'ing....
>
> There are two files w/ mention of hbase, is that right?
>
> St.Ack
>
>
>
> On Mon, May 3, 2010 at 12:23 PM, Dmitriy Ryaboy <dmit...@twitter.com>
> wrote:
> > Hi folks,
> > I recently rewrote the Pig HBase loader to work with binary data, push
> down
> > filters, and do other things that make it more versatile.
> > If you use, or plan to use, both Pig and HBase, please try it out, take a
> > look at the code, let me know what you think.  I am just starting to
> learn
> > about HBase, so I am especially interested to learn if there are HBase
> > capabilities I am not using and should be.
> >
> > The code is part of our "ElephantBird" project, here:
> >
> > http://github.com/kevinweil/elephant-bird/
> > and more specifically:
> >
> http://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load/
> >
> > Thanks,
> > -Dmitriy
> >
>

Reply via email to