Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16989#discussion_r117171649
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
    @@ -278,4 +278,21 @@ package object config {
             "spark.io.compression.codec.")
           .booleanConf
           .createWithDefault(false)
    +
    +  private[spark] val SHUFFLE_ACCURATE_BLOCK_THRESHOLD =
    +    ConfigBuilder("spark.shuffle.accurateBlkThreshold")
    +      .doc("When we compress the size of shuffle blocks in 
HighlyCompressedMapStatus, we will " +
    +        "record the size accurately if it's above the threshold specified 
by this config. This " +
    --- End diff --
    
    One edge-case to consider is the situation where every shuffle block is 
_just_ over this threshold: in this case `HighlyCompressedMapStatus` won't 
really be doing any compression.
    
    Does it make sense to compare to the average and capture the sizes of 
blocks which are more than some percent / threshold above the average? The 
number of such blocks will probably be smaller and this might help to avoid 
worst-case behaviors or excessive bloating of the map output status sizes were 
someone to set this configuration too low.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to