Re: Optimizing Sentry to HDFS protocol

Alexander Kolbasov Tue, 27 Jun 2017 21:59:18 -0700

Lina, thanks for your comments!

> On Jun 27, 2017, at 9:33 PM, Na Li <[email protected]> wrote:
> 
> Sasha,
> 
> 1) "- Rather then streaming huge snapshots in a single message we should
>>  provide streaming protocol with smaller messages and later reassembly on
>>  the HDFS side."
> Based on https://thrift.apache.org/docs/concepts, Thrift transport can be raw 
> TCP or HTTP. HTTP is above TCP. TCP will cut application stream into blocks 
> that can fit into IP packets, which can fit into link layer frames. So 
> application (such as Sentry) does not need to handle such low level 
> processing, such as cutting stream into small messages and then reassemble 
> into stream again. What is the reason you want to do that? Did you see 
> performance issue? We can capture packets on the wire and see the exact 
> protocol stack of Thrift, and decide if we want to change configuration to 
> improve performance.

If the source data lives as a set of record in the DB, how can you send send it 
with Thrift without first creating an in-memory image of the whole dataset? The 
idea is to be able to stream almost directly from DB to the other side with 
very little memory consumption.

The current implementation builds the whole representation in memory as a set 
of not very efficient data structures, then serializes it into an in-memory 
buffer, so you need twice as much memory and then sends it. The receiver needs 
to do the opposite.

> 
> 2) "  - Most of the information passed are long strings with common prefixes.
>> 
>>  We should be able to apply simple compression techniques (e.g. prefix
>>  compression) or even run a full compression on the data before sending."
> Bas d on http://thrift-tutorial.readthedocs.io/en/latest/thrift-stack.html, 
> Thrift supports compression. We can configure its protocol as TDenseProtocol 
> or TCompactProtocol.

Looking at the Compact protocol it doesn’t provide real compression. And since 
we know what kind of data we are dealing with (sets of HDFS paths) we can 
provide very compact representations.

> 
> 3) "  - We should consider using non-thrift data structures for passing the
>> 
>>  info and just use Thrift as a transport mechanism."
> What is the reason you want to make this change? 
> Based on 
> https://stackoverflow.com/questions/9732381/why-thrift-why-not-http-rpcjsongzip,
>  Thrift has several benefits.

Because these structures are not memory-efficient.In your stack overflow 
article it talks about benefits of Thrift over HTTP. Here we are talking about 
internal representation, while keeping Thrift transport.

> 
> Thanks,
> 
> Lina
> 
> Sent from my iPhone
> 
>> On Jun 27, 2017, at 5:44 PM, Alexander Kolbasov <[email protected]> wrote:
>> 
>> Some food for thought.
>> 
>> Currently Sentry uses serialized Thrift structures to send a lot of
>> information from the Sentry Server to the HDFS namenode plugin for the HDFS
>> sync.
>> 
>> We should think of ways to optimize this protocol in several ways:
>> 
>> 
>>  - Rather then streaming huge snapshots in a single message we should
>>  provide streaming protocol with smaller messages and later reassembly on
>>  the HDFS side.
>>  - Most of the information passed are long strings with common prefixes.
>>  We should be able to apply simple compression techniques (e.g. prefix
>>  compression) or even run a full compression on the data before sending.
>>  - We should consider using non-thrift data structures for passing the
>>  info and just use Thrift as a transport mechanism.
>> 
>> - Sasha

Re: Optimizing Sentry to HDFS protocol

Reply via email to