[jira] [Updated] (SPARK-19091) Implement more accurate statistics for LogicalRDD when child is a mapped ParallelCollectionRDD

Josh Rosen (JIRA) Thu, 05 Jan 2017 15:37:23 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen updated SPARK-19091:
-------------------------------
    Summary: Implement more accurate statistics for LogicalRDD when child is a 
mapped ParallelCollectionRDD   (was: Implement more accurate statistics for 
LogicalRDD when child is ParallelCollectionRDD )

> Implement more accurate statistics for LogicalRDD when child is a mapped 
> ParallelCollectionRDD 
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19091
>                 URL: https://issues.apache.org/jira/browse/SPARK-19091
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Josh Rosen
>
> The Catalyst optimizer uses LogicalRDD to represent scans from existing RDDs. 
> In general, it's hard to produce size estimates for arbitrary RDDs, which is 
> why the current implementation simply estimates these relations size using 
> the default size (see the TODO at (for Datasets created via 
> spark.createDataFrame()))
> In the special case where data has been parallelized with 
> {{sc.parallelize()}} we'll wind up with a ParallelCollectionRDD whose number 
> of elements is known. When we construct a LogicalRDD plan node in 
> {{SparkSession.createDataFrame()}} we'll probably be using an RDD which is a 
> one-to-one transformation of a parallel collection RDD (e.g. mapping an 
> encoder to convert case classes to internal rows). Thus we can have 
> LogicalRDD's statistics method pattern-match on cases where we have a 
> MappedPartitionsRDD stacked immediately on top of a ParallelCollectionRDD, 
> then use walk up the RDD parent chain to determine the number of elements and 
> we can combine this with the schema and a conservative "fudge factor" to 
> produce an over-estimate of the LogicalRDD's size which will be more accurate 
> than the default size.
> I believe that this will help us to avoid falling into pathologically bad 
> plans when joining tiny parallelize()d data sets against huge tables and have 
> one of my own production use-cases which would have benefitted directly from 
> such an optimization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19091) Implement more accurate statistics for LogicalRDD when child is a mapped ParallelCollectionRDD

Reply via email to