[
https://issues.apache.org/jira/browse/PARQUET-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114634#comment-14114634
]
Dmitriy V. Ryaboy commented on PARQUET-33:
------------------------------------------
I have the same gut feeling as you, that the protocol stacking is costing us.
> Benchmark the assembly of thrift objects, and possibly create a more
> efficient ReplayingTProtocol
> -------------------------------------------------------------------------------------------------
>
> Key: PARQUET-33
> URL: https://issues.apache.org/jira/browse/PARQUET-33
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Alex Levenson
> Priority: Minor
>
> The current implementation of parquet thrift creates an instance of TProtocol
> for each value of each record and builds a stack of these events, which are
> then replayed back to the TBase.
> I'd be curious to benchmark this, and if it's slow, try building a
> "ReplayingTProtocol" that instead of having a stack of TProtocol instances,
> contains a primitive array of each type. As events are fed into this
> replaying TProtocol, it would just add these primitives to its buffers, and
> then the TBase would drain them. This would effectively let us stream the
> values into the TBase without making an object allocation for each value.
> The buffers could be set to a certain size, and if they fill up (which they
> sholdn't in most cases), the TBase could begin draining the protocol until it
> is empty again, at which point the TProtocol can block the TBase from
> draining further while the parque record assembly feeds it more events.
> This is all moot if it turns out not to be bottleneck though :)
--
This message was sent by Atlassian JIRA
(v6.2#6252)