Philipp Hausmann created THRIFT-5231:
----------------------------------------

             Summary: Improve Haskell parsing performance
                 Key: THRIFT-5231
                 URL: https://issues.apache.org/jira/browse/THRIFT-5231
             Project: Thrift
          Issue Type: Improvement
          Components: Haskell - Library
    Affects Versions: 0.13.0
            Reporter: Philipp Hausmann
         Attachments: Main.hs, parse_benchmark.html

We are using Thrift for (de-)serializing some Kafka messages and noticed that 
already at low throughput (1000 messages / second) a lot of CPU is used.

 

I did a small benchmark just parsing a single T_BINARY value and if I use 
`readVal` for that it takes ~3ms per iteration. If instead I directly run the 
attoparsec parser, it only takes ~ 300ns. This is a difference by 4 orders of 
magnitude! Some difference is reasonable as when using `readVal` some IO and 
shuffling around bytestrings is involved, but the difference looks huge.

 

I strongly suspect the implementation of `runParser` is not optimal. Basically 
it runs the parser with 1 Byte, and until it succeeds it appends 1 byte and 
retries. This means that for a value of size 1024 bytes, we e.g. try to parse 
it 1023 times. This seems rather inefficient.

 

I am not really sure how to best fix this. In principle, it makes sense to feed 
bigger chunks to attoparsec and store the left-overs somewhere for the next 
parse. However, if we store it in the transport or protocol we have to 
implement it for each transport/protocol. Maybe an API change is necessary?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to