I'm not looking for a one-off solution for a specific query that can be solved on the client side as you suggest, but rather a generic solution that can be implemented within the DataSource impl itself when it knows a sub-query can be pushed down into the engine. In other words, I'd like to intercept the query planning process to be able to push-down computation into the engine when it makes sense.
On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo <ing.marco.colo...@gmail.com> wrote: > Why don't you create a dataframe filtered, map it as temporary table and > then use it in your query? You can also cache it, of multiple queries on the > same inner queries are requested. > > > Il mercoledì 27 luglio 2016, Timothy Potter <thelabd...@gmail.com> ha > scritto: >> >> Take this simple join: >> >> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER >> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON >> solr.movie_id = m.movie_id ORDER BY aggCount DESC >> >> I would like the ability to push the inner sub-query aliased as "solr" >> down into the data source engine, in this case Solr as it will >> greatlly reduce the amount of data that has to be transferred from >> Solr into Spark. I would imagine this issue comes up frequently if the >> underlying engine is a JDBC data source as well ... >> >> Is this possible? Of course, my example is a bit cherry-picked so >> determining if a sub-query can be pushed down into the data source >> engine is probably not a trivial task, but I'm wondering if Spark has >> the hooks to allow me to try ;-) >> >> Cheers, >> Tim >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > > > -- > Ing. Marco Colombo --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org