[ https://issues.apache.org/jira/browse/ARROW-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870756#comment-16870756 ]
Wes McKinney commented on ARROW-4099: ------------------------------------- What we probably need to do is implement a global size bound on the output of {{PrettyPrint}} so that we bail out early when we hit a particular limit (e.g. around a megabyte or so). This is a pretty significant refactor of {{src/arrow/pretty_print.cc}} since there are many functions that write directly into {{std::ostream}} without any size book-keeping. This isn't causing enough of a user problem to require us to fix it right now > [Python] Pretty printing very large ChunkedArray objects can use unbounded > memory > --------------------------------------------------------------------------------- > > Key: ARROW-4099 > URL: https://issues.apache.org/jira/browse/ARROW-4099 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Wes McKinney > Priority: Major > Fix For: 0.14.0 > > > In working on ARROW-2970, I have the following dataset: > {code} > values = [b'x'] + [ > b'x' * (1 << 20) > ] * 2 * (1 << 10) > arr = np.array(values) > arrow_arr = pa.array(arr) > {code} > The object {{arrow_arr}} has 129 chunks, each element of which is 1MB of > binary. The repr for this object is over 600MB: > {code} > In [10]: rep = repr(arrow_arr) > In [11]: len(rep) > Out[11]: 637536258 > {code} > There's probably a number of failsafes we can implement to avoid badness in > these pathological cases (which may not happen often, but given the kinds of > bug reports we are seeing, people do have datasets that look like this) -- This message was sent by Atlassian JIRA (v7.6.3#76005)