[GitHub] spark pull request #14973: [SPARK-17356][SQL][1.6] Fix out of memory issue w...

clockfly Tue, 06 Sep 2016 03:13:14 -0700

GitHub user clockfly opened a pull request:

    https://github.com/apache/spark/pull/14973


    [SPARK-17356][SQL][1.6] Fix out of memory issue when generating JSON for 
TreeNode

    This is a backport of PR https://github.com/apache/spark/pull/14915 to 
branch 1.6.
    
    ## What changes were proposed in this pull request?
    
    class `org.apache.spark.sql.types.Metadata` is widely used in mllib to 
store some ml attributes. `Metadata` is commonly stored in `Alias` expression. 
    
    ```
    case class Alias(child: Expression, name: String)(
        val exprId: ExprId = NamedExpression.newExprId,
        val qualifier: Option[String] = None,
        val explicitMetadata: Option[Metadata] = None,
        override val isGenerated: java.lang.Boolean = false)
    ```
    
    The `Metadata` can take a big memory footprint since the number of 
attributes is big ( in scale of million). When `toJSON` is called on `Alias` 
expression, the `Metadata` will also be converted to a big JSON string. 
    If a plan contains many such kind of `Alias` expressions, it may trigger 
out of memory error when `toJSON` is called, since converting all `Metadata` 
references to JSON will take huge memory.
     
    With this PR, we will skip scanning Metadata when doing JSON conversion. 
For a reproducer of the OOM, and analysis, please look at jira 
https://issues.apache.org/jira/browse/SPARK-17356. 
    
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/clockfly/spark json_oom_1.6

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14973.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14973
    
----
commit 0b22ec689a7534c555d1a3c623f817f788163ca3
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-06T09:26:41Z

    [SPARK-17369][SQL] MetastoreRelation toJSON throws AssertException due to 
missing otherCopyArgs
    
    ## What changes were proposed in this pull request?
    
    `TreeNode.toJSON` requires a subclass to explicitly override otherCopyArgs 
to include currying construction arguments, otherwise it reports 
AssertException telling that the construction argument values' count doesn't 
match the construction argument names' count.
    
    For class `MetastoreRelation`, it has a currying construction parameter 
`client: HiveClient`, but Spark forgets to add it to the list of otherCopyArgs.
    
    ## How was this patch tested?
    
    Unit tests.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14973: [SPARK-17356][SQL][1.6] Fix out of memory issue w...

Reply via email to