I'm not looking for a one-off solution for a specific query that can
be solved on the client side as you suggest, but rather a generic
solution that can be implemented within the DataSource impl itself
when it knows a sub-query can be pushed down into the engine. In other
words, I'd like to intercept the query planning process to be able to
push-down computation into the engine when it makes sense.

On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
<ing.marco.colo...@gmail.com> wrote:
> Why don't you create a dataframe filtered, map it as temporary table and
> then use it in your query? You can also cache it, of multiple queries on the
> same inner queries are requested.
>
>
> Il mercoledì 27 luglio 2016, Timothy Potter <thelabd...@gmail.com> ha
> scritto:
>>
>> Take this simple join:
>>
>> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
>> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
>> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>>
>> I would like the ability to push the inner sub-query aliased as "solr"
>> down into the data source engine, in this case Solr as it will
>> greatlly reduce the amount of data that has to be transferred from
>> Solr into Spark. I would imagine this issue comes up frequently if the
>> underlying engine is a JDBC data source as well ...
>>
>> Is this possible? Of course, my example is a bit cherry-picked so
>> determining if a sub-query can be pushed down into the data source
>> engine is probably not a trivial task, but I'm wondering if Spark has
>> the hooks to allow me to try ;-)
>>
>> Cheers,
>> Tim
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> --
> Ing. Marco Colombo

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to