Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/455#issuecomment-45415893 So after looking at this more. I think it's great to add to support for sequence file format in Python, it will be super useful. This patch takes things a step further and also introduces an API for users to plug in different types of Hadoop input formats. I'm a bit concerned about exposing that to users (which this patch does via the `Converter` stuff) because I'm sure people will start building on it and it's not an interface I'd be happy supporting going forward. For one thing, it only applies to Python. Another thing is that there are other outstanding community proposals for how to deal with things like HBase support (see #194). In the near future (possibly 1.1) we're planning to standardize the way this works via SchemaRDD, and that will be automatically supported in Python. Basically, we'll have ways to read and write to a SchemaRDD from several storage systems. So I just don't think it's a good idea to introduce a different pluggable mechanism here and encourage users to integrate at this level, and I wonder if we should just merge a version of this patch that only exposes sequenceFile and not the more general converter mechanism. @mateiz @marmbrus curious to hear your thoughts as well
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---