Hi, I have asked this before but didn't receive any comments, but with the impending release of 1.5 I wanted to bring this up again. Right now, Spark is very tightly coupled with OSS Hive & Hadoop which causes me a lot of work every time there is a new version because I don't run OSS Hive/Hadoop versions (and before you ask, I can't).
My question is, does Spark need to be so tightly coupled with these two ? Or put differently, would it be possible to introduce a developer API between Spark (up and including e.g. SqlContext) and Hadoop (for HDFS bits) and Hive (e.g. HiveContext and beyond) and move the actual Hadoop & Hive dependencies into plugins (e.g. separate maven modules)? This would allow me to easily maintain my own Hive/Hadoop-ish integration with our internal systems without ever having to touch Spark code. I expect this could also allow for instance Hadoop vendors to provide their own, more optimized implementations without Spark having to know about them. cheers, Tom