[ 
https://issues.apache.org/jira/browse/PARQUET-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114634#comment-14114634
 ] 

Dmitriy V. Ryaboy commented on PARQUET-33:
------------------------------------------

I have the same gut feeling as you, that the protocol stacking is costing us.

> Benchmark the assembly of thrift objects, and possibly create a more 
> efficient ReplayingTProtocol
> -------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-33
>                 URL: https://issues.apache.org/jira/browse/PARQUET-33
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Alex Levenson
>            Priority: Minor
>
> The current implementation of parquet thrift creates an instance of TProtocol 
> for each value of each record and builds a stack of these events, which are 
> then replayed back to the TBase.
> I'd be curious to benchmark this, and if it's slow, try building a 
> "ReplayingTProtocol" that instead of having a stack of TProtocol instances, 
> contains a primitive array of each type. As events are fed into this 
> replaying TProtocol, it would just add these primitives to its buffers, and 
> then the TBase would drain them. This would effectively let us stream the 
> values into the TBase without making an object allocation for each value.
> The buffers could be set to a certain size, and if they fill up (which they 
> sholdn't in most cases), the TBase could begin draining the protocol until it 
> is empty again, at which point the TProtocol can block the TBase from 
> draining further while the parque record assembly feeds it more events.
> This is all moot if it turns out not to be bottleneck though :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to