Re: Optimizing Sentry to HDFS protocol

Alexander Kolbasov Wed, 28 Jun 2017 10:13:11 -0700

Clarification - I am talking about cases where messages can be several Gbs
in size.


On Jun 27, 2017, at 9:33 PM, Na Li <lina...@cloudera.com> wrote:

Sasha,

1) "- Rather then streaming huge snapshots in a single message we should

 provide streaming protocol with smaller messages and later reassembly on
 the HDFS side."

Based on https://thrift.apache.org/docs/concepts, Thrift transport can be
raw TCP or HTTP. HTTP is above TCP. TCP will cut application stream into
blocks that can fit into IP packets, which can fit into link layer frames.
So application (such as Sentry) does not need to handle such low level
processing, such as cutting stream into small messages and then reassemble
into stream again. What is the reason you want to do that? Did you see
performance issue? We can capture packets on the wire and see the exact
protocol stack of Thrift, and decide if we want to change configuration to
improve performance.


If the source data lives as a set of record in the DB, how can you send
send it with Thrift without first creating an in-memory image of the whole
dataset? The idea is to be able to stream almost directly from DB to the
other side with very little memory consumption.

The current implementation builds the whole representation in memory as a
set of not very efficient data structures, then serializes it into an
in-memory buffer, so you need twice as much memory and then sends it. The
receiver needs to do the opposite.


2) "  - Most of the information passed are long strings with common
prefixes.


 We should be able to apply simple compression techniques (e.g. prefix
 compression) or even run a full compression on the data before sending."

Bas d on http://thrift-tutorial.readthedocs.io/en/latest/thrift-stack.html,
Thrift supports compression. We can configure its protocol as
TDenseProtocol or TCompactProtocol.


Looking at the Compact protocol it doesn’t provide real compression. And
since we know what kind of data we are dealing with (sets of HDFS paths) we
can provide very compact representations.



3) "  - We should consider using non-thrift data structures for passing the


 info and just use Thrift as a transport mechanism."

What is the reason you want to make this change?
Based on https://stackoverflow.com/questions/9732381/why-thrift-why-
not-http-rpcjsongzip, Thrift has several benefits.


Because these structures are not memory-efficient.In your stack overflow
article it talks about benefits of Thrift over HTTP. Here we are talking
about internal representation, while keeping Thrift transpor


Thanks,

Lina

Sent from my iPhone

On Jun 27, 2017, at 5:44 PM, Alexander Kolbasov <ak...@cloudera.com> wrote:

Some food for thought.

Currently Sentry uses serialized Thrift structures to send a lot of
information from the Sentry Server to the HDFS namenode plugin for the HDFS
sync.

We should think of ways to optimize this protocol in several ways:


 - Rather then streaming huge snapshots in a single message we should
 provide streaming protocol with smaller messages and later reassembly on
 the HDFS side.
 - Most of the information passed are long strings with common prefixes.
 We should be able to apply simple compression techniques (e.g. prefix
 compression) or even run a full compression on the data before sending.
 - We should consider using non-thrift data structures for passing the
 info and just use Thrift as a transport mechanism.

- Sasha

Re: Optimizing Sentry to HDFS protocol

Reply via email to