Hi Everyone, HBase allows clients to load data into HBase by generating HFiles in a MapReduce job and then loading those HFiles into HBase via running the CompleteBulkLoad tool. We'd like to enable this behavior in Crunch.
Getting crunch to generate HFiles as the result of the job is as simple as configuring the correct output format. The question of where/when to invoke the CompleteBulkLoad tool on those generated files is a little trickier. I originally posed this question to just Josh but on his suggestion I thought I'd open it up to the whole group. Josh's original response is below and suggests adding a callback mechanism to Target. This sounds like a good idea to me. Does anyone else have some thoughts / ideas on the issue? Thanks! -Kiyan >From Josh: Are you asking from the Crunch perspective, or the HBase perspective? HBase has the CompleteBulkLoad tool, so I'm assuming you're asking about the right way to wire it up into Crunch? http://hbase.apache.org/book/arch.bulk.load.html It seems like we would want a callback on Targets that would notify them that the output they were interested in had been generated and that they should do whatever subsequent processing on it they would want, right? That could either be a hook on Target itself, or some sub-interface of Target that we might check for at the end of a job-- but it seems like putting it on Target itself is the right approach. Is that what you guys were contemplating? Also, feel free to put this on crunch-dev, I'm sure other folks will be interested even if they don't have a lot to contribute to the implementation discussion.
