[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

liupengcheng (Jira) Thu, 06 Feb 2020 17:58:18 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032066#comment-17032066
 ]


liupengcheng commented on SPARK-30712:
--------------------------------------

OK, thanks! [~hyukjin.kwon].

> Estimate sizeInBytes from file metadata for parquet files
> ---------------------------------------------------------
>
>                 Key: SPARK-30712
>                 URL: https://issues.apache.org/jira/browse/SPARK-30712
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: liupengcheng
>            Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

Reply via email to