Just a thought

Well in Spark RDDs are immutable which is an advantage compared to a
conventional IMDB like Oracle TimesTen meaning concurrency is not an issue
for certain indexes.

The overriding optimisation (as there is no Physical IO) has to be reducing
memory footprint and CPU demands and using indexes may help for full key
lookups. if I recall correctly in-memory databases support hash-indexes and
T-tree indexes which are pretty common in these situations. But there is an
overhead in creating indexes on RDDS and I presume parallelize those
indexes.

With regard to getting data into RDD from say an underlying table in Hive
into a temp table, then depending on the size of that temp table, one can
debate an index on that temp table.

The question is what use case do you have in mind.?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 17:08, Michael Segel <msegel_had...@hotmail.com> wrote:

> I’m not sure where to post this since its a bit of a philosophical
> question in terms of design and vision for spark.
>
> If we look at SparkSQL and performance… where does Secondary indexing fit
> in?
>
> The reason this is a bit awkward is that if you view Spark as querying
> RDDs which are temporary, indexing doesn’t make sense until you consider
> your use case and how long is ‘temporary’.
> Then if you consider your RDD result set could be based on querying
> tables… and you could end up with an inverted table as an index… then
> indexing could make sense.
>
> Does it make sense to discuss this in user or dev email lists? Has anyone
> given this any thought in the past?
>
> Thx
>
> -Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to