nickva opened a new issue #659: max_document_size check is not accurate
URL: https://github.com/apache/couchdb/issues/659
 
 
   `max_document_size` currently checks document sizes based on Erlang's 
external term size of the jiffy-decoded document body.  This makes sense 
because that's what used to store the data on disk and it's what manipulated by 
the CouchDB internals.
   
   However erlang term size is not always a good approximation of the size of 
json encoded data. Sometimes it can be way off (I've seen 30% off) and It's 
hard for users to estimate or check the external term size beforehand. So for 
example if max_document_size is 1MB, CouchDB might reject user's 600KB json 
document because Erlang's external term size of that document greater than 1MB.
   
   ## Possible Solutions
   
    * Do nothing. If the current size check is to throttle or limit disk usage 
and disk usage is driven by the external size (though in most cases compression 
is used on it, so it will usually be less), then on that level it makes sense 
to keep it as is.
   
    * Re-encode the data using jiffy and check the size against that. That's a 
better check but will impact performance.  However this is also not an exact 
solution. Users' json encoder might insert more whitespace (say as 
indentation), or whitespace after commas, use a different algorithm for 
encoding floating point numbers (scientific notation, represent exact floating 
point numbers without a decimal point (`5` instead of `5.0` etc.). So the size 
would still be off.
   
    * Like above but enhance jiffy to return an "encoded size" without actually 
doing the encoding. Or least have it do the encoding internally and just return 
byte size to Erlang instead of a full term which will have to be thrown away.
   
    * Here is at attempt to do a size check in Erlang. Since jiffy is pretty 
quick is might end being slower than doing the full encoding in jiffy:
   ```
   encsize({[]}) ->
       2;  % opening { and closing }
   
   encsize({Props}) ->
       % 2 is because opening { and closing }
       % Inside the lc 2 is for : and , but counts an extra , at end
       % -1 is to subtract last , from lc part
       2 + lists:sum([encsize(K) + encsize(V) + 2 || {K, V} <- Props]) - 1;
   
   encsize([]) ->
       2;  % opening [ and closing ]
   
   encsize(List) when is_list(List) ->
       % 2 is for [ and ]
       % inside the lc 1 is for , but it counts one extra , for last element
       2 + lists:sum([encsize(V) + 1 || V <- List]) - 1;
   
   encsize(Float) when is_float(Float) ->
       % this doesn't match what jiffy does. so will be off
       byte_size(float_to_binary(Float, [{decimals, 16}, compact]));
   
   encsize(Integer) when is_integer(Integer), Integer >= 0 ->
       trunc(math:log10(Integer)) + 1;
   
   encsize(Integer) when is_integer(Integer), Integer < 0 ->
       % 2 is because 1 is for the - character
       trunc(math:log10(Integer)) + 2;
   
   encsize(Binary) when is_binary(Binary) ->
       % count escapes and assume they'd be escaped with one extra escape 
character
       Escapes = [<<"\b">>, <<"\f">>, <<"\n">>, <<"\r">>, <<"\t">>, <<"\\">>,
           <<"\"">>],
       MatchCount = length(binary:matches(Binary, Escapes)),
       % 2 is for open and closing "
       2 + byte_size(Binary) + MatchCount;
   
   encsize(Atom) when is_atom(Atom) ->
       encsize(atom_to_binary(Atom, utf8)).
   
   ```
     Maybe modify it to provide an always underestimating check. So users get 
the benefit of the doubt. That is whatever encoding they pick, find a not too 
expensive check that will be accurate enough and always be smaller?
    
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to