Linking to the JIRA tracking APIs to hook into the planner: https://issues.apache.org/jira/browse/SPARK-3248
On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin <r...@databricks.com> wrote: > Hi Rajendran, > > I'm assuming you have some concept of schema and you are intending to > integrate with SchemaRDD instead of normal RDDs. > > More responses inline below. > > > On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu <appra...@in.ibm.com> > wrote: > >> >> I am new to Spark source code and looking to see if i can add push-down >> support of spark filters to the storage (in my >> case an object store). I am willing to consider how this can be >> generically done for any store that we might want to >> integrate with spark. I am looking to know the areas that I should look >> into to provide support for a new data store in >> this context. Following below are some of the questions I have to start >> with: >> >> 1. Do we need to create a new RDD class for the new store that we want >> to support? From Spark Context, we create an RDD >> and the operations on data including the filter are performed through >> the RDD methods. >> > > You can create a new RDD type for a new storage system, and you can create > a new table scan operator in sql to read. > > >> 2. When we specify the code for filter task in the RDD.filter() method, >> how does it get communicated to the Executor on >> the data node? Does the Executor need to compile this code on the fly >> and execute it? or how does it work? ( I have >> looked at the code for sometime, but not yet got to figuring this out, >> so i am looking for some pointers that can help me >> come a little up-to-speed in this part of the code) >> > > Right now the best way to do this is to hack the sql strategies, which > does some predicate pushdown into table scan: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala > > We are in the process of proposing an API that allows external data stores > to hook into the planner. Expect a design proposal in early/mid Sept. > > Once that is in place, you wouldn't need to hack the planner anymore. It > is a good idea to start prototyping by hacking the planner, and migrate to > the planner hook API once that is ready. > > >> >> 3. How long the Executor holds the memory? and how does it decide when >> to release the memory/cache? >> > > Executors by default actually don't hold any data in memory. Spark > requires explicit caching of data, i.e. it's only when rdd.cache() is > called then will Spark executors put the content of that RDD in-memory. The > executor has a thing called BlockManager that does eviction based on LRU. > > > >> >> Thank you in advance. >> >> >> >> >> >> Regards, >> Rajendran. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >