[ 
https://issues.apache.org/jira/browse/THRIFT-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056830#comment-15056830
 ] 

John R. Frank commented on THRIFT-3175:
---------------------------------------

[~roger.meier] [~dvirsky],  this "enhancement" is actually a problem.  Thrift 
messages that were previously valid are now invalid (!) and the only remedy is 
to [pin on v0.9.2 of the python package, which we have now done in 
streamcorpus|https://github.com/diffeo/streamcorpus/commit/71a3d50e3c89e4284c1868a5d671c2919a7e586e
 ].  It's not uncommon for an automatic data extractor to detect long lists, 
and expressing them in Thrift is a natural step.  We had no trouble creating 
such messages... and now they cannot be read with the latest version of the 
python library!

I can understand how big lists cause problems in the JVM, but fixing that way 
down in the transport layer with a hard coded limit is obviously a flawed 
solution.  

That said, we've been systematically moving away from Thrift to 
[cbor|http://cbor.io] and, wherever practical, just json.  This max-list-size 
issue  is going to accelerate that transition, which is sad.  As a former 
long-time fan of Thrift, I'd love to see Thrift move toward a more cbor-like 
implementation, maybe even taking cbor as a wireline format.

Sorry to be a downer on this, and hopefully future iterations of Thrift can 
learn from these kinds of challenges.

> fastbinary.c python deserialize can cause huge allocations from garbage
> -----------------------------------------------------------------------
>
>                 Key: THRIFT-3175
>                 URL: https://issues.apache.org/jira/browse/THRIFT-3175
>             Project: Thrift
>          Issue Type: Bug
>          Components: Python - Library
>            Reporter: Dvir Volk
>            Assignee: Dvir Volk
>             Fix For: 0.9.3
>
>
> In the fastbinary python deserializer, allocating a list is done like so:
> {code}
>     len = readI32(input);
>     if (!check_ssize_t_32(len)) {
>       return NULL;
>     }
>     ret = PyList_New(len);
> {code}
> The only validation of len is that it's under INT_MAX. I've encountered a 
> situation where upon receiving garbage input, and having len be read as 
> something like 1 billion, the library treated this as a valid input, 
> allocated gigs of RAM,  and caused a server to crash. 
> The quick fix I made was to limit list sizes to a sane value of a few 
> thousands that more than suits my personal needs. 
> But IMO this should be dealt with properly. One way that comes to mind is not 
> pre-allocating the entire list in advance in case it's really big, and 
> resizing it in smaller steps while reading the input. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to