Hi Stack, Thanks for taking a look! We have an apache license file in the root of the project; I am not sure if we need to put it in every file. Will check with the lawyers.
Regarding the first and last slice, the problem is that I have no way of knowing what the first and last, respectively, key values are. With the first slice I can maybe cache the first key I see, and use that in conjunction with the end of the region to calculate the size of the keyspace; but with that last region, the max is infinity, so I can't really estimate how much more I have left until I have none.. do regions store any metadata that countains a rough count of the number of records they hold? I guess they only keep track of the byte size of the data, not the number of records per se. Maybe I can get the total byte size of the region, and calculate offsets based on the size of the returned data? This would be likely wrong due to pushed down projections and filters, of course. Any other ideas? How do people normally handle this when writing regular MR jobs that scan HBase tables? I suspect this is actually a bit of a problem, btw -- since I don't report the amount of remaining work for these slices accurately, and I (hopefully) do a reasonable job for the ones where I can calculate the size of the keyspace, speculative execution may get overeager with these two slices. Will take care of the HBaseConfiguration thing. PigCounterHelper just deals with some oddities of Hadoop counters (they may not be available when you first try to increment a counter -- the helper buffers increment requests until the reporter becomes available). Are HBase counters special things or also just Hadoop counters under the covers? The lzo files are probably unrelated.. there shouldn't be anything LZO-specific in the HBase code. We are, in fact, lzo'ing hbase content in the sense that that's the compression we have for HDFS, and I think HBase is supposed to inherit that. But that's at the HDFS level. The ElephantBird library is a grab-bag of all kinds of hadoop stuff, so you can find lots of things unrelated to HBase in there, mostly dealing with protocol buffers and LZO compression. The only 2 HBase files are HBaseLoader and HBaseSlice. Thanks again for checking it out! -Dmitriy On Mon, May 3, 2010 at 8:48 PM, Stack <st...@duboce.net> wrote: > Hey Dmitry: > > I took a quick look. > > Your files are missing a copyright? > > I like your using of BinaryComparatory and the lte, gte, options in > skipRegion setting up filters. > > Regards: > > " // No way to know max.. just return 0. Sorry, reporting on the > last slice is janky. > // So is reporting on the first slice, by the way -- it will start > out too high, possibly at 100%. > if (endRow_.length==0) return 0; > " > > > ...if your keys are kinda regular, you might be able to do better in a > slice. See in Bytes where there are methods that do BigDecimal math. > You can ask them to divide the slice. Might work. Then you could do > progress (Looks like you are doing some later in the file -- does it > work?). > > Try to use the same version of " HBaseConfiguration conf = new > HBaseConfiguration();" throughout rather than create a new one each > time. Can be more costly. > > Whats this? > > if (counterHelper_ == null) counterHelper_ = new PigCounterHelper(); > > A pig counter? You don't want to use hbase counters? > > Whats the lzo stuff about? It seems to be for loading files. Are you > lzo'ing your hbase content? > > Oh man ... base64'ing.... > > There are two files w/ mention of hbase, is that right? > > St.Ack > > > > On Mon, May 3, 2010 at 12:23 PM, Dmitriy Ryaboy <dmit...@twitter.com> > wrote: > > Hi folks, > > I recently rewrote the Pig HBase loader to work with binary data, push > down > > filters, and do other things that make it more versatile. > > If you use, or plan to use, both Pig and HBase, please try it out, take a > > look at the code, let me know what you think. I am just starting to > learn > > about HBase, so I am especially interested to learn if there are HBase > > capabilities I am not using and should be. > > > > The code is part of our "ElephantBird" project, here: > > > > http://github.com/kevinweil/elephant-bird/ > > and more specifically: > > > http://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load/ > > > > Thanks, > > -Dmitriy > > >