Re: Coding in the Spark ml "ecosystem" why is everything private?!

Sean Owen Mon, 29 Aug 2016 09:52:43 -0700

If something isn't public, then it could change across even
maintenance releases. Although you can indeed still access it in some
cases by writing code in the same package, you're taking some risk
that it will stop working across releases.


If it's not public, the message is that you should build it yourself,
yes. For example, OpenHashSet will never be meant to be reused. You
can use your own from a library.

If there's a clear opportunity to expose something cleanly you can
bring it up for discussion. But it's never just a matter of making
something public. Making it public means committing others' time to
supporting it as-is for years. It would have to be worth it.

On Mon, Aug 29, 2016 at 5:46 PM, Thunder Stumpges
<thunder.stump...@gmail.com> wrote:
> Hi all,
>
> I'm not sure if this belongs here in users or over in dev as I guess it's
> somewhere in between. We have been starting to implement some machine
> learning pipelines, and it seemed from the documentation that Spark had a
> fairly well thought-out platform (see:
> http://spark.apache.org/docs/1.6.1/ml-guide.html )
>
> I liked the design of Transformers, Models, Estimators, Pipelines, etc.
> However as soon as we began attempting to code our first ones, we began
> running into one class or method after another that has been marked
> private... Some examples are:
>
> - SchemaUtils - (for validating schemas passed in and out, and adding output
> columns to DataFrames)
> - Loader / Saveable (traits / helpers for saving and loading models)
> - Several classes under 'collection' namespace like OpenHashSet /
> OpenHashMap
> - All of the underlying linear algebra Breeze details
> - Other classes specific to certain models. We are writing an alternative
> LDA Optimizer / Trainer and everything under LDAUtils is private.
>
> I'd like to ask what the expected approach is here. I see a few options,
> none of which seem appropriate:
>
> 1. Implement everything in the org.apache.spark.* namespaces to match
> package privates
>     - will this even work in our own modules ?
>     - we would be open to contributing some of our code back but not sure
> the project wants it
> 2. Implement our own versions of all of these things.
>    - lots of extra work for us, leads to unseen gotchas in implementations
> and other unforseen issues
> 3. Copy classes into our namespace for use
>    - duplicates code, leads to code diversion as the main code is kept up to
> date.
>
> Thanks in advance for any recommendations on this frustrating issue.
> Thunder
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Coding in the Spark ml "ecosystem" why is everything private?!

Reply via email to