[ 
https://issues.apache.org/jira/browse/ARROW-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870756#comment-16870756
 ] 

Wes McKinney commented on ARROW-4099:
-------------------------------------

What we probably need to do is implement a global size bound on the output of 
{{PrettyPrint}} so that we bail out early when we hit a particular limit (e.g. 
around a megabyte or so). This is a pretty significant refactor of 
{{src/arrow/pretty_print.cc}} since there are many functions that write 
directly into {{std::ostream}} without any size book-keeping. This isn't 
causing enough of a user problem to require us to fix it right now

> [Python] Pretty printing very large ChunkedArray objects can use unbounded 
> memory
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-4099
>                 URL: https://issues.apache.org/jira/browse/ARROW-4099
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.14.0
>
>
> In working on ARROW-2970, I have the following dataset:
> {code}
> values = [b'x'] + [
>     b'x' * (1 << 20)
> ] * 2 * (1 << 10)
> arr = np.array(values)
> arrow_arr = pa.array(arr)
> {code}
> The object {{arrow_arr}} has 129 chunks, each element of which is 1MB of 
> binary. The repr for this object is over 600MB:
> {code}
> In [10]: rep = repr(arrow_arr)
> In [11]: len(rep)
> Out[11]: 637536258
> {code}
> There's probably a number of failsafes we can implement to avoid badness in 
> these pathological cases (which may not happen often, but given the kinds of 
> bug reports we are seeing, people do have datasets that look like this)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to