[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151368#comment-15151368 ]
Max Seiden commented on SPARK-12449: ------------------------------------ Very interested in checking out that PR! It would be prudent to have a holistic high-level design for any work here too, mostly to answer a few major questions. A random sample of such Qs: + Should there be a new trait for each new `sources.*` type, or a single trait that communicates capabilities to the planner (i.e. the CatalystSource design)? a) a new trait for each source could get unwieldy given the potential # of permutations b) a single, generic trait is powerful, but it puts a lot of burden on the implementer to cover more cases than they may want + Depending on the above, should source plans be a tree of operators or a list of operators to be applied in-order? a) the first option is more natural, but is smells a lot like catalyst -- not a bad thing if it's a separate, stable API though + the more that's pushed down via sources.Expressions, the more complex things may get for implementers a) for example, if Aliases are pushed down, there's a lot more opportunity for resolution bugs in the source impl b) a definitive stance would be needed for exprs like UDFs or those dealing with complex types c) without a way to signal capabilities (implicitly or explicitly) to the planner, there'd likely need to be a way to "bail out" > Pushing down arbitrary logical plans to data sources > ---------------------------------------------------- > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org