[ 
https://issues.apache.org/jira/browse/THRIFT-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132703#comment-17132703
 ] 

Jens Geyer edited comment on THRIFT-5231 at 6/10/20, 8:41 PM:
--------------------------------------------------------------

https://github.com/apache/thrift/blob/master/lib/java/src/org/apache/thrift/protocol/TType.java

The c++ headers seem to need some cleanup. I could track it back to commit 
d42a2c2bf9630cfb4d9d49cbee1fc812e5e5777d when the  various string type 
constants had been added. Some of the constants were changed later again in   
d97eb611202c25d2210c647f32d7e780cfe319a6

These numerical constants have never neen used AFAIK:

 * T_UTF8       = 16,
 * T_UTF16      = 17

This is the right type for strings:

 * T_STRING     = 11,   ...

And this seems plain wrong. As per Whitepaper, all strings in Thrift are 
transmitted as UTF-8 across the wire, not UTF-7. Sot 11 should be T_UTF8, but 
IMHO all of these except T_STRING should be thrown out.

 * T_UTF7       = 11,

Maybe [~mcslee] wants to add more insights?



was (Author: jensg):
https://github.com/apache/thrift/blob/master/lib/java/src/org/apache/thrift/protocol/TType.java

The c++ headers seem to need some cleanup. I could track it back to commit 
d42a2c2bf9630cfb4d9d49cbee1fc812e5e5777d when the  various string type 
constants had bveen added.

These numerical constants have never neen used AFAIK:

 * T_UTF8       = 16,
 * T_UTF16      = 17

This is the right type for strings:

 * T_STRING     = 11,   ...

And this seems plain wrong. As per Whitepaper, all strings in Thrift are 
transmitted as UTF-8 across the wire, not UTF-7. Sot 11 should be T_UTF8, but 
IMHO all of these except T_STRING should be thrown out.

 * T_UTF7       = 11,

Maybe [~mcslee] wants to add more insights?


> Improve Haskell parsing performance
> -----------------------------------
>
>                 Key: THRIFT-5231
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5231
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Haskell - Library
>    Affects Versions: 0.13.0
>            Reporter: Philipp Hausmann
>            Priority: Major
>         Attachments: Main.hs, parse_benchmark.html
>
>
> We are using Thrift for (de-)serializing some Kafka messages and noticed that 
> already at low throughput (1000 messages / second) a lot of CPU is used.
>  
> I did a small benchmark just parsing a single T_BINARY value and if I use 
> `readVal` for that it takes ~3ms per iteration. If instead I directly run the 
> attoparsec parser, it only takes ~ 300ns. This is a difference by 4 orders of 
> magnitude! Some difference is reasonable as when using `readVal` some IO and 
> shuffling around bytestrings is involved, but the difference looks huge.
>  
> I strongly suspect the implementation of `runParser` is not optimal. 
> Basically it runs the parser with 1 Byte, and until it succeeds it appends 1 
> byte and retries. This means that for a value of size 1024 bytes, we e.g. try 
> to parse it 1023 times. This seems rather inefficient.
>  
> I am not really sure how to best fix this. In principle, it makes sense to 
> feed bigger chunks to attoparsec and store the left-overs somewhere for the 
> next parse. However, if we store it in the transport or protocol we have to 
> implement it for each transport/protocol. Maybe an API change is necessary?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to