Am 09.03.2015 um 05:01 schrieb lars hofhansl: > Thanks for looking into this Wilm. > I would honestly suggest just writing larger lobs directly into HDFS and just > store the location in HBase. > You can do that with a relatively simple protocol, with reasonable safety:1. > Write the metadata row into HBase2. Write the LOB into HDFS3. When the LOB > was written, update the metadata row with the LOBs location.4. Report success > back to the client that would be a client side approach, which of course would work, but which has some downsides (e.g. being out of sync as you pointed out). On the other hand ... no large change of core hbase code ;).
But of course by this the small files problem (which i'm facing) is only solved half way through. If I use your 1MB threshold and let's say a mean size of 5 MB of one "LOB" and the limitation to ~5M "larger" files (due to namenode) ... I'm around 2.5 TB raw "LOB data", which isn't that large. Or 100 TB for a 10MB threshold and a medium size of 20 MB for LOBs ... or 200 TB for 10 MB threshold and doubled namenode RAM etc. etc. By this I can catch the real small stuff. But I'm still bound for "a little larger MOBs" or "small LOBs". However, this is still way beyond my current application problems, thus the problem is more of an academic nature :/. > If the LOB is small... maybe < 1mb, you'd just write it into HBase as a value > (preferably into a different column family) > > If the process fails at #2 or #3 you'd have an orphaned file in HDFS, but > those are easy to find (metadata rows for which the location is unset, and > older than - say - a few days) I would use a map red on the file names and search in the hbase => if not found => delete. But yeah, some how in a client fashion. > Your BigPut and BigGet could just be an API around this process. yupp. As two independent developers gave the same answer i'll drop the idea and go further on the client way. Thanks for the fast reply, Wilm
