[ https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley resolved SPARK-25124. --------------------------------------- Resolution: Fixed Fix Version/s: 2.3.2 Issue resolved by pull request 22228 [https://github.com/apache/spark/pull/22228] > VectorSizeHint.size is buggy, breaking streaming pipeline > --------------------------------------------------------- > > Key: SPARK-25124 > URL: https://issues.apache.org/jira/browse/SPARK-25124 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.1 > Reporter: Timothy Hunter > Assignee: Huaxin Gao > Priority: Major > Labels: beginner, starter > Fix For: 2.4.0, 2.3.2 > > > Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, > transforming a stream will return a nondescript exception about the stream > not started. At core are the following bugs that {{setSize}} and {{getSize}} > do not {{return}} values but {{None}}: > https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846 > How to reproduce, using the example in the doc: > {code} > from pyspark.ml.linalg import Vectors > from pyspark.ml import Pipeline, PipelineModel > from pyspark.ml.feature import VectorAssembler, VectorSizeHint > data = [(Vectors.dense([1., 2., 3.]), 4.)] > df = spark.createDataFrame(data, ["vector", "float"]) > sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) > # Will fail > vecAssembler = VectorAssembler(inputCols=["vector", "float"], > outputCol="assembled") > pipeline = Pipeline(stages=[sizeHint, vecAssembler]) > pipelineModel = pipeline.fit(df) > pipelineModel.transform(df).head().assembled > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org