Hi,

I have asked this before but didn't receive any comments, but with the
impending release of 1.5 I wanted to bring this up again.
Right now, Spark is very tightly coupled with OSS Hive & Hadoop which
causes me a lot of work every time there is a new version because I don't
run OSS Hive/Hadoop versions (and before you ask, I can't).

My question is, does Spark need to be so tightly coupled with these two ?
Or put differently, would it be possible to introduce a developer API
between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits)
and Hive (e.g. HiveContext and beyond) and move the actual Hadoop & Hive
dependencies into plugins (e.g. separate maven modules)?
This would allow me to easily maintain my own Hive/Hadoop-ish integration
with our internal systems without ever having to touch Spark code.
I expect this could also allow for instance Hadoop vendors to provide their
own, more optimized implementations without Spark having to know about them.

cheers,
Tom

Reply via email to