Hi Andrew,
Thx for your replies.

I may give a try one day to this indirection layer if someone does not pick it before me :)

On 01/07/11 18:34, Andrew Purtell wrote:
One reasonable way to handle native storage of large objects in HBase would
be to introduce a layer of indirection.

Do you see this layer on the client or on the server side?


Client side.

I was also thinking on the "update": Le's say we store a new version of
the large object which is smaller than the previous one (less chunks).
The previously created chunks will remain for the TimeToLive, but could
be potentially removed. The indirection layer would be responsible for
this maintenance?


Yes.

Store the chunks in a manner that gets good distribution in the keyspace,
maybe by SHA-1 hash of the content.

An alternative would be to add a "_chunk#" to the original key value.
I guess you prefer to randomly distribute the chunks in the available
regions?


Yes. This will increase the probability that a MultiAction<Get>  of the chunks is 
parallelized over multiple region servers. That would be beneficial for distributing load, but 
also if most or all of the chunks are in the same region -- as would be the case with appending 
"_chunk#" to the key -- then performance will suffer because they will be retrieved 
serially.

With "index", you mean a list of chunk keys?


Yes.


Storing the large ones in HDFS and simply having the pointer in HBase
allows to benefit from HDFS streaming.

I was wondering if it was already discussed on a StreamingPut
(StreamingGet)?


The way HBase RPC currently works, it's not possible to stream data out of 
HBase. The objects that satisfy your Get or Scanner.next request are marshalled 
fully into the RPC response, which is sent all at once.

You could use the HBase REST gateway and therefore stream the response through. 
In that case your client side access to the HBase cluster is via your favorite 
HTTP client library. But then your actions transit a gateway, which adds 
latency (and the gateway must buffer the objects fully in memory), and if 
addressing resources in a RESTful manner there are HTTP transaction overheads 
to consider. This type of configuration would work best for supporting user 
facing services that are RESTful in nature themselves: API services, websites.


Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)



--
Eric

Reply via email to