I did an initial implementation. There are two assumptions i had from the
start that I was very surprised were not a part of the predicate pushdown
API:

1) The fields in the SELECT clause are not pushed down to the predicate
pushdown API. I have many optimizations that allow fields to be filtered
out before the resulting object is serialized on the Accumulo tablet
server. How can I get the selection information from the execution plan?
I'm a little hesitant to implement the data relation that allows me to see
the logical plan because it's noted in the comments that it could change
without warning.

2) I'm surprised to find that the predicate pushdown filters get completely
removed when I do anything more complex in a where clause other than simple
AND statements. Using an OR statement caused the filter array that was
passed into the PrunedFilteredDataSource to be empty.


I have an example [1] of what I'm trying to accomplish.

[1]
https://github.com/calrissian/accumulo-recipes/blob/273/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStore.scala#L49


On Fri, Jan 16, 2015 at 10:17 PM, Corey Nolet <cjno...@gmail.com> wrote:

> Hao,
>
> Thanks so much for the links! This is exactly what I'm looking for. If I
> understand correctly, I can extend PrunedFilteredScan, PrunedScan, and
> TableScan and I should be able to support all the sql semantics?
>
> I'm a little confused about the Array[Filter] that is used with the
> Filtered scan. I have the ability to perform pretty robust seeks in the
> underlying data sets in Accumulo. I have an inverted index and I'm able to
> do intersections as well as unions- and rich predicates which form a tree
> of alternating intersections and unions. If I understand correctly- the
> Array[Filter] is to be treated as an AND operator? Do OR operators get
> propagated through the API at all? I'm trying to do as much pairing down of
> the dataset as possible on the individual tablet servers so that the data
> loaded into the spark layer is minimal- really used to perform joins,
> groupBys, sortBys and other computations that would require the relations
> to be combined in various ways.
>
> Thanks again for pointing me to this.
>
>
>
> On Fri, Jan 16, 2015 at 2:07 AM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
>>  The Data Source API probably work for this purpose.
>>
>> It support the column pruning and the Predicate Push Down:
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>
>>
>>
>> Examples also can be found in the unit test:
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources
>>
>>
>>
>>
>>
>> *From:* Corey Nolet [mailto:cjno...@gmail.com]
>> *Sent:* Friday, January 16, 2015 1:51 PM
>> *To:* user
>> *Subject:* Spark SQL Custom Predicate Pushdown
>>
>>
>>
>> I have document storage services in Accumulo that I'd like to expose to
>> Spark SQL. I am able to push down predicate logic to Accumulo to have it
>> perform only the seeks necessary on each tablet server to grab the results
>> being asked for.
>>
>>
>>
>> I'm interested in using Spark SQL to push those predicates down to the
>> tablet servers. Where wouldI begin my implementation? Currently I have an
>> input format which accepts a "query object" that gets pushed down. How
>> would I extract this information from the HiveContext/SQLContext to be able
>> to push this down?
>>
>
>

Reply via email to