Hello,
In developing new third-party pipeline components for Spark ML 1.4 (see
dl4j-spark-ml), I encountered a few gaps in the earlier effort to make the ML
Developer APIs public (SPARK-5995). I plan to file issues after we discuss
on this thread. The below is a list of types that are presently private but
might best be made public.
VectorUDT. To define a relation with a vector field, VectorUDT must be
instantiated.
SchemaUtils. Third-party pipeline components have a need for checking column
types and appending columns.
Identifiable trait. The trait generates a unique identifier for the
associated pipeline component. Nice to have a consistent format by reusing the
trait.
ProbabilisticClassifier. Third-party components should leverage the complex
logic around computing only selected columns.
Shared Params (HasLabel, HasFeatures). This is covered in SPARK-7146 but
reiterating it here.
Thanks,
Eron Wright