[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151368#comment-15151368
 ] 

Max Seiden commented on SPARK-12449:
------------------------------------

Very interested in checking out that PR! It would be prudent to have a holistic 
high-level design for any work here too, mostly to answer a few major 
questions. A random sample of such Qs:

+ Should there be a new trait for each new `sources.*` type, or a single trait 
that communicates capabilities to the planner (i.e. the CatalystSource design)?
      a) a new trait for each source could get unwieldy given the potential # 
of permutations
      b) a single, generic trait is powerful, but it puts a lot of burden on 
the implementer to cover more cases than they may want
 
+ Depending on the above, should source plans be a tree of operators or a list 
of operators to be applied in-order?
      a) the first option is more natural, but is smells a lot like catalyst -- 
not a bad thing if it's a separate, stable API though

+ the more that's pushed down via sources.Expressions, the more complex things 
may get for implementers 
      a) for example, if Aliases are pushed down, there's a lot more 
opportunity for resolution bugs in the source impl
      b) a definitive stance would be needed for exprs like UDFs or those 
dealing with complex types
      c) without a way to signal capabilities (implicitly or explicitly) to the 
planner, there'd likely need to be a way to "bail out"

> Pushing down arbitrary logical plans to data sources
> ----------------------------------------------------
>
>                 Key: SPARK-12449
>                 URL: https://issues.apache.org/jira/browse/SPARK-12449
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Stephan Kessler
>         Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to