[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-14087.
----------------------------------
       Resolution: Resolved
    Fix Version/s: 2.0.0

This is no longer an issue as the PySpark wrapper class {{JavaModel}} calls 
{{_resetUid}} to brute force update all UIDs in the model to that of the Java 
Object.  This is slightly different than how the Scala side works by overriding 
the UID value on construction.  I think it would be better to mimic that, but 
I'll close this since it's working now.

> PySpark ML JavaModel does not properly own params after being fit
> -----------------------------------------------------------------
>
>                 Key: SPARK-14087
>                 URL: https://issues.apache.org/jira/browse/SPARK-14087
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
>         [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to