Github user rxin commented on the issue: https://github.com/apache/spark/pull/17723 I didn't read through the super long debate here, but I have a strong preference to not expose Hadoop APIs directly. I'm seeing more and more deployments out there that do not use Hadoop (e.g. connect directly to cloud storage, connect to some on-premise object store, connect to Redis, connect to some netapp appliance, connect directly to a message queue or just run Spark on a laptop). Hadoop APIs were designed for a different world pre Spark. Serialization is painful (Configuration?) to deal with, API breaking changes are painful to deal with, size of the dependencies are painful to deal with (especially considering the single node use cases in which ideally we'd just want a super trimmed down jar). As you can see (although most of you that have chimed in here don't know much about the new components), the newer components (Spark SQL) does not expose Hadoop APIs.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org