Joseph K. Bradley created SPARK-20099:
-----------------------------------------

             Summary: Add transformSchema to pyspark.ml
                 Key: SPARK-20099
                 URL: https://issues.apache.org/jira/browse/SPARK-20099
             Project: Spark
          Issue Type: Improvement
          Components: ML, PySpark
    Affects Versions: 2.1.0
            Reporter: Joseph K. Bradley


Python's ML API currently lacks the PipelineStage abstraction.  This 
abstraction's main purpose is to provide transformSchema() for checking for 
early failures in a Pipeline.

As mentioned in https://github.com/apache/spark/pull/17218 it would also be 
useful in Python for checking Params in Python wrapper for Scala 
implementations; in these, transformSchema would involve passing Params in 
Python to Scala, which would then be able to validate the Param values.  This 
could prevent late failures from bad Param settings in Pipeline execution, 
while still allowing us to check Param values on only the Scala side.

This issue is for adding transformSchema() to pyspark.ml.  If it's reasonable, 
we could create a PipelineStage abstraction.  But it'd probably be fine to add 
transformSchema() directly to Transformer and Estimator, rather than creating 
PipelineStage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to