Re: Optimizing Sentry to HDFS protocol

2017-06-28 Thread Brian Towles
On Tue, Jun 27, 2017 at 5:45 PM Alexander Kolbasov 
wrote:

>
>- Rather then streaming huge snapshots in a single message we should
>provide streaming protocol with smaller messages and later reassembly on
>the HDFS side.
>
> [bt] If we are going to keep with the current flow then I think this is
going to be one of the better options.  Multiple calls chunking out the
paths to some optimized number per thrift call (like 1k) with a "thats all
folks" call would allow the structure to be assembled on the HDFS side.

But I feel that we really don't need to send all of the data over to the
HDFS side immediately,  I feel we could make on demand calls (maybe on a
per directory basis) by the HDFS client to Sentry which then populates a
cache on the HDFS node side. Updates would still being pushed as they
occur, hense a slow loading of the cache. Yes this would slow down initial
access on the first call, but since this is for direct HDFS managed and
served paths that initial call slow down would be fairly negligible
assuming the Sentry turn around for the call would be fairly easy.  This is
essentially what Hive does without the cache and the update I believe.




>
>- Most of the information passed are long strings with common
>prefixes. We should be able to apply simple compression techniques (e.g.
>prefix compression) or even run a full compression on the data before
>sending.
>
>
I think if we are thinking this we should really look at passing a true
tree structure instead of trying to compress the data outright.  If its a
tree structure each part if only listed once in its place in the tree.


>
>- We should consider using non-thrift data structures for passing the
>info and just use Thrift as a transport mechanism.
>
> Im not sure why we would break protocol compatibility with something
custom.  I feel we can work around this.  Im not convinced we can, but i
think this should be a last resort.




> - Sasha
>
-- 
*Brian Towles* | Software Engineer
t. (512) 415- <00>8105 e. btow...@cloudera.com 
cloudera.com 

[image: Cloudera] 

[image: Cloudera on Twitter]  [image:
Cloudera on Facebook]  [image: Cloudera
on LinkedIn] 
--


Re: Optimizing Sentry to HDFS protocol

2017-06-28 Thread Alexander Kolbasov
Clarification - I am talking about cases where messages can be several Gbs
in size.

On Jun 27, 2017, at 9:33 PM, Na Li  wrote:

Sasha,

1) "- Rather then streaming huge snapshots in a single message we should

 provide streaming protocol with smaller messages and later reassembly on
 the HDFS side."

Based on https://thrift.apache.org/docs/concepts, Thrift transport can be
raw TCP or HTTP. HTTP is above TCP. TCP will cut application stream into
blocks that can fit into IP packets, which can fit into link layer frames.
So application (such as Sentry) does not need to handle such low level
processing, such as cutting stream into small messages and then reassemble
into stream again. What is the reason you want to do that? Did you see
performance issue? We can capture packets on the wire and see the exact
protocol stack of Thrift, and decide if we want to change configuration to
improve performance.


If the source data lives as a set of record in the DB, how can you send
send it with Thrift without first creating an in-memory image of the whole
dataset? The idea is to be able to stream almost directly from DB to the
other side with very little memory consumption.

The current implementation builds the whole representation in memory as a
set of not very efficient data structures, then serializes it into an
in-memory buffer, so you need twice as much memory and then sends it. The
receiver needs to do the opposite.


2) "  - Most of the information passed are long strings with common
prefixes.


 We should be able to apply simple compression techniques (e.g. prefix
 compression) or even run a full compression on the data before sending."

Bas d on http://thrift-tutorial.readthedocs.io/en/latest/thrift-stack.html,
Thrift supports compression. We can configure its protocol as
TDenseProtocol or TCompactProtocol.


Looking at the Compact protocol it doesn’t provide real compression. And
since we know what kind of data we are dealing with (sets of HDFS paths) we
can provide very compact representations.



3) "  - We should consider using non-thrift data structures for passing the


 info and just use Thrift as a transport mechanism."

What is the reason you want to make this change?
Based on https://stackoverflow.com/questions/9732381/why-thrift-why-
not-http-rpcjsongzip, Thrift has several benefits.


Because these structures are not memory-efficient.In your stack overflow
article it talks about benefits of Thrift over HTTP. Here we are talking
about internal representation, while keeping Thrift transpor


Thanks,

Lina

Sent from my iPhone

On Jun 27, 2017, at 5:44 PM, Alexander Kolbasov  wrote:

Some food for thought.

Currently Sentry uses serialized Thrift structures to send a lot of
information from the Sentry Server to the HDFS namenode plugin for the HDFS
sync.

We should think of ways to optimize this protocol in several ways:


 - Rather then streaming huge snapshots in a single message we should
 provide streaming protocol with smaller messages and later reassembly on
 the HDFS side.
 - Most of the information passed are long strings with common prefixes.
 We should be able to apply simple compression techniques (e.g. prefix
 compression) or even run a full compression on the data before sending.
 - We should consider using non-thrift data structures for passing the
 info and just use Thrift as a transport mechanism.

- Sasha


Re: Optimizing Sentry to HDFS protocol

2017-06-27 Thread Alexander Kolbasov

Lina, thanks for your comments!

> On Jun 27, 2017, at 9:33 PM, Na Li  wrote:
> 
> Sasha,
> 
> 1) "- Rather then streaming huge snapshots in a single message we should
>>  provide streaming protocol with smaller messages and later reassembly on
>>  the HDFS side."
> Based on https://thrift.apache.org/docs/concepts, Thrift transport can be raw 
> TCP or HTTP. HTTP is above TCP. TCP will cut application stream into blocks 
> that can fit into IP packets, which can fit into link layer frames. So 
> application (such as Sentry) does not need to handle such low level 
> processing, such as cutting stream into small messages and then reassemble 
> into stream again. What is the reason you want to do that? Did you see 
> performance issue? We can capture packets on the wire and see the exact 
> protocol stack of Thrift, and decide if we want to change configuration to 
> improve performance.

If the source data lives as a set of record in the DB, how can you send send it 
with Thrift without first creating an in-memory image of the whole dataset? The 
idea is to be able to stream almost directly from DB to the other side with 
very little memory consumption.

The current implementation builds the whole representation in memory as a set 
of not very efficient data structures, then serializes it into an in-memory 
buffer, so you need twice as much memory and then sends it. The receiver needs 
to do the opposite.

> 
> 2) "  - Most of the information passed are long strings with common prefixes.
>> 
>>  We should be able to apply simple compression techniques (e.g. prefix
>>  compression) or even run a full compression on the data before sending."
> Bas d on http://thrift-tutorial.readthedocs.io/en/latest/thrift-stack.html, 
> Thrift supports compression. We can configure its protocol as TDenseProtocol 
> or TCompactProtocol.

Looking at the Compact protocol it doesn’t provide real compression. And since 
we know what kind of data we are dealing with (sets of HDFS paths) we can 
provide very compact representations.


> 
> 3) "  - We should consider using non-thrift data structures for passing the
>> 
>>  info and just use Thrift as a transport mechanism."
> What is the reason you want to make this change? 
> Based on 
> https://stackoverflow.com/questions/9732381/why-thrift-why-not-http-rpcjsongzip,
>  Thrift has several benefits.

Because these structures are not memory-efficient.In your stack overflow 
article it talks about benefits of Thrift over HTTP. Here we are talking about 
internal representation, while keeping Thrift transport.

> 
> Thanks,
> 
> Lina
> 
> Sent from my iPhone
> 
>> On Jun 27, 2017, at 5:44 PM, Alexander Kolbasov  wrote:
>> 
>> Some food for thought.
>> 
>> Currently Sentry uses serialized Thrift structures to send a lot of
>> information from the Sentry Server to the HDFS namenode plugin for the HDFS
>> sync.
>> 
>> We should think of ways to optimize this protocol in several ways:
>> 
>> 
>>  - Rather then streaming huge snapshots in a single message we should
>>  provide streaming protocol with smaller messages and later reassembly on
>>  the HDFS side.
>>  - Most of the information passed are long strings with common prefixes.
>>  We should be able to apply simple compression techniques (e.g. prefix
>>  compression) or even run a full compression on the data before sending.
>>  - We should consider using non-thrift data structures for passing the
>>  info and just use Thrift as a transport mechanism.
>> 
>> - Sasha



Re: Optimizing Sentry to HDFS protocol

2017-06-27 Thread Na Li
Sasha,

1) "- Rather then streaming huge snapshots in a single message we should
>   provide streaming protocol with smaller messages and later reassembly on
>   the HDFS side."
Based on https://thrift.apache.org/docs/concepts, Thrift transport can be raw 
TCP or HTTP. HTTP is above TCP. TCP will cut application stream into blocks 
that can fit into IP packets, which can fit into link layer frames. So 
application (such as Sentry) does not need to handle such low level processing, 
such as cutting stream into small messages and then reassemble into stream 
again. What is the reason you want to do that? Did you see performance issue? 
We can capture packets on the wire and see the exact protocol stack of Thrift, 
and decide if we want to change configuration to improve performance.


2) "  - Most of the information passed are long strings with common prefixes.
> 
>   We should be able to apply simple compression techniques (e.g. prefix
>   compression) or even run a full compression on the data before sending."
Bas d on http://thrift-tutorial.readthedocs.io/en/latest/thrift-stack.html, 
Thrift supports compression. We can configure its protocol as TDenseProtocol or 
TCompactProtocol.

3) "  - We should consider using non-thrift data structures for passing the
> 
>   info and just use Thrift as a transport mechanism."
What is the reason you want to make this change? 
Based on 
https://stackoverflow.com/questions/9732381/why-thrift-why-not-http-rpcjsongzip,
 Thrift has several benefits.

Thanks,

Lina

Sent from my iPhone

> On Jun 27, 2017, at 5:44 PM, Alexander Kolbasov  wrote:
> 
> Some food for thought.
> 
> Currently Sentry uses serialized Thrift structures to send a lot of
> information from the Sentry Server to the HDFS namenode plugin for the HDFS
> sync.
> 
> We should think of ways to optimize this protocol in several ways:
> 
> 
>   - Rather then streaming huge snapshots in a single message we should
>   provide streaming protocol with smaller messages and later reassembly on
>   the HDFS side.
>   - Most of the information passed are long strings with common prefixes.
>   We should be able to apply simple compression techniques (e.g. prefix
>   compression) or even run a full compression on the data before sending.
>   - We should consider using non-thrift data structures for passing the
>   info and just use Thrift as a transport mechanism.
> 
> - Sasha