Hi, Over the past few months, I have seen a bunch of pull requests which have extended spark api ... most commonly RDD itself.
Most of them are either relatively niche case of specialization (which might not be useful for most cases) or idioms which can be expressed (sometimes with minor perf penalty) using existing api. While all of them have non zero value (hence the effort to contribute, and gladly welcomed !) they are extending the api in nontrivial ways and have a maintenance cost ... and we already have a pending effort to clean up our interfaces prior to 1.0 I believe there is a need to keep exposed api succint, expressive and functional in spark; while at the same time, encouraging extensions and specialization within spark codebase so that other users can benefit from the shared contributions. One approach could be to start something akin to piggybank in pig to contribute user generated specializations, helper utils, etc : bundled as part of spark, but not part of core itself. Thoughts, comments ? Regards, Mridul