Alex Levenson created PARQUET-33:
------------------------------------
Summary: Benchmark the assembly of thrift objects, and possibly
create a more efficient ReplayingTProtocol
Key: PARQUET-33
URL: https://issues.apache.org/jira/browse/PARQUET-33
Project: Parquet
Issue Type: Improvement
Components: parquet-mr
Reporter: Alex Levenson
Priority: Minor
The current implementation of parquet thrift creates an instance of TProtocol
for each value of each record and builds a stack of these events, which are
then replayed back to the TBase.
I'd be curious to benchmark this, and if it's slow, try building a
"ReplayingTProtocol" that instead of having a stack of TProtocol instances,
contains a primitive array of each type. As events are fed into this replaying
TProtocol, it would just add these primitives to its buffers, and then the
TBase would drain them. This would effectively let us stream the values into
the TBase without making an object allocation for each value.
The buffers could be set to a certain size, and if they fill up (which they
sholdn't in most cases), the TBase could begin draining the protocol until it
is empty again, at which point the TProtocol can block the TBase from draining
further while the parque record assembly feeds it more events.
This is all moot if it turns out not to be bottleneck though :)
--
This message was sent by Atlassian JIRA
(v6.2#6252)