Re: Adding support for a new object store

Reynold Xin Wed, 27 Aug 2014 14:23:14 -0700

Linking to the JIRA tracking APIs to hook into the planner:
https://issues.apache.org/jira/browse/SPARK-3248





On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin <r...@databricks.com> wrote:

> Hi Rajendran,
>
> I'm assuming you have some concept of schema and you are intending to
> integrate with SchemaRDD instead of normal RDDs.
>
> More responses inline below.
>
>
> On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu <appra...@in.ibm.com>
> wrote:
>
>>
>>  I am new to Spark source code and looking to see if i can add push-down
>> support of spark filters to the storage (in my
>>  case an object store). I am willing to consider how this can be
>> generically done for any store that we might want to
>>  integrate with spark. I am looking to know the areas that I should look
>> into to provide support for a new data store in
>>  this context. Following below are some of the questions I have to start
>> with:
>>
>>  1. Do we need to create a new RDD class for the new store that we want
>> to support? From Spark Context, we create an RDD
>>  and the operations on data including the filter are performed through
>> the RDD methods.
>>
>
> You can create a new RDD type for a new storage system, and you can create
> a new table scan operator in sql to read.
>
>
>>  2. When we specify the code for filter task in the RDD.filter() method,
>> how does it get communicated to the Executor on
>>  the data node? Does the Executor need to compile this code on the fly
>> and execute it? or how does it work? ( I have
>>  looked at the code for sometime, but not yet got to figuring this out,
>> so i am looking for some pointers that can help me
>>  come a little up-to-speed in this part of the code)
>>
>
> Right now the best way to do this is to hack the sql strategies, which
> does some predicate pushdown into table scan:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
>
> We are in the process of proposing an API that allows external data stores
> to hook into the planner. Expect a design proposal in early/mid Sept.
>
> Once that is in place, you wouldn't need to hack the planner anymore. It
> is a good idea to start prototyping by hacking the planner, and migrate to
> the planner hook API once that is ready.
>
>
>>
>>  3. How long the Executor holds the memory? and how does it decide when
>> to release the memory/cache?
>>
>
> Executors by default actually don't hold any data in memory. Spark
> requires explicit caching of data, i.e. it's only when rdd.cache() is
> called then will Spark executors put the content of that RDD in-memory. The
> executor has a thing called BlockManager that does eviction based on LRU.
>
>
>
>>
>>  Thank you in advance.
>>
>>
>>
>>
>>
>> Regards,
>> Rajendran.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: Adding support for a new object store

Reply via email to