I understand all that and can respect a team's desire not to have to
support many little internal details of a system; but at the same time, I
am talking about valuable aspects of a platform that make coding components
that work in an ecosystem viable. Let's put aside the OpenHashMap as I see
that supports your point and there are other libraries available.

However If I implement my own versions of Model save/load traits, How is
that even interoperable with the Pipeline class? And for schema validation
- the DataType class doesn't even expose the "equalsIgnoreNullability"...
so how would you check whether two types were the same?

There must be a balance between creating an environment that allows people
to relatively easily create components that work in said environment, and
having to "commit others' time to supporting it"...

It just feels like the balance is in general far to the "lock down" side of
the scale.
Thanks,
Thunder


On Mon, Aug 29, 2016 at 9:50 AM Sean Owen <so...@cloudera.com> wrote:

> If something isn't public, then it could change across even
> maintenance releases. Although you can indeed still access it in some
> cases by writing code in the same package, you're taking some risk
> that it will stop working across releases.
>
> If it's not public, the message is that you should build it yourself,
> yes. For example, OpenHashSet will never be meant to be reused. You
> can use your own from a library.
>
> If there's a clear opportunity to expose something cleanly you can
> bring it up for discussion. But it's never just a matter of making
> something public. Making it public means committing others' time to
> supporting it as-is for years. It would have to be worth it.
>
> On Mon, Aug 29, 2016 at 5:46 PM, Thunder Stumpges
> <thunder.stump...@gmail.com> wrote:
> > Hi all,
> >
> > I'm not sure if this belongs here in users or over in dev as I guess it's
> > somewhere in between. We have been starting to implement some machine
> > learning pipelines, and it seemed from the documentation that Spark had a
> > fairly well thought-out platform (see:
> > http://spark.apache.org/docs/1.6.1/ml-guide.html )
> >
> > I liked the design of Transformers, Models, Estimators, Pipelines, etc.
> > However as soon as we began attempting to code our first ones, we began
> > running into one class or method after another that has been marked
> > private... Some examples are:
> >
> > - SchemaUtils - (for validating schemas passed in and out, and adding
> output
> > columns to DataFrames)
> > - Loader / Saveable (traits / helpers for saving and loading models)
> > - Several classes under 'collection' namespace like OpenHashSet /
> > OpenHashMap
> > - All of the underlying linear algebra Breeze details
> > - Other classes specific to certain models. We are writing an alternative
> > LDA Optimizer / Trainer and everything under LDAUtils is private.
> >
> > I'd like to ask what the expected approach is here. I see a few options,
> > none of which seem appropriate:
> >
> > 1. Implement everything in the org.apache.spark.* namespaces to match
> > package privates
> >     - will this even work in our own modules ?
> >     - we would be open to contributing some of our code back but not sure
> > the project wants it
> > 2. Implement our own versions of all of these things.
> >    - lots of extra work for us, leads to unseen gotchas in
> implementations
> > and other unforseen issues
> > 3. Copy classes into our namespace for use
> >    - duplicates code, leads to code diversion as the main code is kept
> up to
> > date.
> >
> > Thanks in advance for any recommendations on this frustrating issue.
> > Thunder
> >
>

Reply via email to