I have to clarify something… 
In SparkSQL, we can query against both immutable existing RDDs, and 
Hive/HBase/MapRDB/<insert data source>  which are mutable.  
So we have to keep this in mind while we are talking about secondary indexing. 
(Its not just RDDs)

I think the only advantage to being immutable is that once you generate and 
index the RDD, its not going to change, so the ‘correctness’ or RI is implicit. 
Here, the issue becomes how long will the RDD live. There is a cost to generate 
the index, which has to be weighed against its usefulness and the longevity of 
the underlying RDD. Since the RDD is typically associated to a single spark 
context, building indexes may be cost prohibitive. 

At the same time… if you are dealing with a large enough set of data… you will 
have I/O. Both in terms of networking and Physical. This is true of both Spark 
and in-memory RDBMs.  This is due to the footprint of data along with the need 
to persist the data. 

But I digress. 

So in one scenario, we’re building our RDDs from a system that has indexing 
available.  Is it safe to assume that SparkSQL will take advantage of the 
indexing in the underlying system? (Imagine sourcing data from an Oracle or DB2 
database in order to build RDDs.) If so, then we don’t have to work about 
indexing. 

In another scenario, we’re joining an RDD against a table in an RDBMS. Is it 
safe to assume that Spark will select data from the database in to an RDD prior 
to attempting to do the join?  Here, the RDBMs table will use its index when 
you execute the query? (Again its an assumption…)  Then you have two data sets 
that then need to be joined, which leads to the third scenario…

Joining two spark RDDs. 
Going from memory, its a hash join. Here the RDD is used to create a hash table 
which would imply an index   of the hash key.  So for joins, you wouldn’t need 
a secondary index? 
They wouldn’t provide any value due to the hash table being created. (And you 
would probably apply the filter while you inserted a row in to the hash table 
before the join. ) 

Did I just answer my own question? 



> On May 30, 2016, at 10:58 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Just a thought
> 
> Well in Spark RDDs are immutable which is an advantage compared to a 
> conventional IMDB like Oracle TimesTen meaning concurrency is not an issue 
> for certain indexes.
> 
> The overriding optimisation (as there is no Physical IO) has to be reducing 
> memory footprint and CPU demands and using indexes may help for full key 
> lookups. if I recall correctly in-memory databases support hash-indexes and 
> T-tree indexes which are pretty common in these situations. But there is an 
> overhead in creating indexes on RDDS and I presume parallelize those indexes.
> 
> With regard to getting data into RDD from say an underlying table in Hive 
> into a temp table, then depending on the size of that temp table, one can 
> debate an index on that temp table.
> 
> The question is what use case do you have in mind.?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 17:08, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> I’m not sure where to post this since its a bit of a philosophical question 
> in terms of design and vision for spark.
> 
> If we look at SparkSQL and performance… where does Secondary indexing fit in?
> 
> The reason this is a bit awkward is that if you view Spark as querying RDDs 
> which are temporary, indexing doesn’t make sense until you consider your use 
> case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying tables… 
> and you could end up with an inverted table as an index… then indexing could 
> make sense.
> 
> Does it make sense to discuss this in user or dev email lists? Has anyone 
> given this any thought in the past?
> 
> Thx
> 
> -Mike
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 

Reply via email to