[ 
https://issues.apache.org/jira/browse/THRIFT-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130703#comment-17130703
 ] 

Philipp Hausmann commented on THRIFT-5231:
------------------------------------------

Turns out this is significantly more complicated than I expected.

 

The confusion between binary and string values in the thrift protocol seems to 
imply that the protocol needs to know what type has to be parsed next. E.g. for 
the JSON protocol this is necessary to distinguish between parsing a normal 
string or a base64 encoded value. This makes the usual design adopted e.g. by 
Aeson using an intermediate datatype infeasible.

 

Second, the header format appears to entangle the protocol / transport layers 
which makes it harder to separate the layers cleanly.

 

Btw, is there any documentation what the current basic thrift types are? There 
seems to be an UTF8 and an UTF 16 type in the c implementation 
([https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/protocol/TProtocol.h#L179]
 ), but they are not mentioned in the documentation nor are they present in 
e.g. the Java implementation?

> Improve Haskell parsing performance
> -----------------------------------
>
>                 Key: THRIFT-5231
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5231
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Haskell - Library
>    Affects Versions: 0.13.0
>            Reporter: Philipp Hausmann
>            Priority: Major
>         Attachments: Main.hs, parse_benchmark.html
>
>
> We are using Thrift for (de-)serializing some Kafka messages and noticed that 
> already at low throughput (1000 messages / second) a lot of CPU is used.
>  
> I did a small benchmark just parsing a single T_BINARY value and if I use 
> `readVal` for that it takes ~3ms per iteration. If instead I directly run the 
> attoparsec parser, it only takes ~ 300ns. This is a difference by 4 orders of 
> magnitude! Some difference is reasonable as when using `readVal` some IO and 
> shuffling around bytestrings is involved, but the difference looks huge.
>  
> I strongly suspect the implementation of `runParser` is not optimal. 
> Basically it runs the parser with 1 Byte, and until it succeeds it appends 1 
> byte and retries. This means that for a value of size 1024 bytes, we e.g. try 
> to parse it 1023 times. This seems rather inefficient.
>  
> I am not really sure how to best fix this. In principle, it makes sense to 
> feed bigger chunks to attoparsec and store the left-overs somewhere for the 
> next parse. However, if we store it in the transport or protocol we have to 
> implement it for each transport/protocol. Maybe an API change is necessary?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to