GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/13537

    [SPARK-15794] Should truncate toString() of very wide schemas

    ## What changes were proposed in this pull request?
    
    With very wide tables, e.g. thousands of fields, the output is unreadable 
and often causes OOMs due to inefficient string processing. This truncates all 
struct and operator field lists to a user configurable threshold to limit 
performance and readability impact.
    
    It would also be nice to optimize string generation to avoid these sort of 
O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including 
expressions), but this is probably too large of a change for 2.0 at this point.
    
    ## How was this patch tested?
    
    Added a microbenchmark that covers this case particularly well. I also ran 
the microbenchmark while varying the truncation threshold.
    
    ```
    numFields = 5
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2000 wide x 50 rows (write in-mem)            2336 / 2558          0.0      
 23364.4       0.1X
    
    numFields = 25
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2000 wide x 50 rows (write in-mem)            4237 / 4465          0.0      
 42367.9       0.1X
    
    numFields = 100
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2000 wide x 50 rows (write in-mem)          10458 / 11223          0.0      
104582.0       0.0X
    
    numFields = Infinity
    wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    [info]   java.lang.OutOfMemoryError: Java heap space
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark truncated-string

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13537
    
----
commit d16e0f3e22287a7f3779ed24239d84179602e30a
Author: Eric Liang <e...@databricks.com>
Date:   2016-06-07T00:56:06Z

    truncate strings

commit f4f4368d3550b864c6286ce04770990b41c6741c
Author: Eric Liang <e...@databricks.com>
Date:   2016-06-07T01:37:13Z

    Mon Jun  6 18:37:13 PDT 2016

commit 17f98d76aec40bc7c6b8c46925d4013f9bccd639
Author: Eric Liang <e...@databricks.com>
Date:   2016-06-07T01:43:24Z

    Mon Jun  6 18:43:24 PDT 2016

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to