Clarification - I am talking about cases where messages can be several Gbs in size.
On Jun 27, 2017, at 9:33 PM, Na Li <lina...@cloudera.com> wrote: Sasha, 1) "- Rather then streaming huge snapshots in a single message we should provide streaming protocol with smaller messages and later reassembly on the HDFS side." Based on https://thrift.apache.org/docs/concepts, Thrift transport can be raw TCP or HTTP. HTTP is above TCP. TCP will cut application stream into blocks that can fit into IP packets, which can fit into link layer frames. So application (such as Sentry) does not need to handle such low level processing, such as cutting stream into small messages and then reassemble into stream again. What is the reason you want to do that? Did you see performance issue? We can capture packets on the wire and see the exact protocol stack of Thrift, and decide if we want to change configuration to improve performance. If the source data lives as a set of record in the DB, how can you send send it with Thrift without first creating an in-memory image of the whole dataset? The idea is to be able to stream almost directly from DB to the other side with very little memory consumption. The current implementation builds the whole representation in memory as a set of not very efficient data structures, then serializes it into an in-memory buffer, so you need twice as much memory and then sends it. The receiver needs to do the opposite. 2) " - Most of the information passed are long strings with common prefixes. We should be able to apply simple compression techniques (e.g. prefix compression) or even run a full compression on the data before sending." Bas d on http://thrift-tutorial.readthedocs.io/en/latest/thrift-stack.html, Thrift supports compression. We can configure its protocol as TDenseProtocol or TCompactProtocol. Looking at the Compact protocol it doesn’t provide real compression. And since we know what kind of data we are dealing with (sets of HDFS paths) we can provide very compact representations. 3) " - We should consider using non-thrift data structures for passing the info and just use Thrift as a transport mechanism." What is the reason you want to make this change? Based on https://stackoverflow.com/questions/9732381/why-thrift-why- not-http-rpcjsongzip, Thrift has several benefits. Because these structures are not memory-efficient.In your stack overflow article it talks about benefits of Thrift over HTTP. Here we are talking about internal representation, while keeping Thrift transpor Thanks, Lina Sent from my iPhone On Jun 27, 2017, at 5:44 PM, Alexander Kolbasov <ak...@cloudera.com> wrote: Some food for thought. Currently Sentry uses serialized Thrift structures to send a lot of information from the Sentry Server to the HDFS namenode plugin for the HDFS sync. We should think of ways to optimize this protocol in several ways: - Rather then streaming huge snapshots in a single message we should provide streaming protocol with smaller messages and later reassembly on the HDFS side. - Most of the information passed are long strings with common prefixes. We should be able to apply simple compression techniques (e.g. prefix compression) or even run a full compression on the data before sending. - We should consider using non-thrift data structures for passing the info and just use Thrift as a transport mechanism. - Sasha