[ https://issues.apache.org/jira/browse/SPARK-41950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656097#comment-17656097 ]
Apache Spark commented on SPARK-41950: -------------------------------------- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/39467 > mlflow doctest fails for pandas API on SPark > -------------------------------------------- > > Key: SPARK-41950 > URL: https://issues.apache.org/jira/browse/SPARK-41950 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark > Affects Versions: 3.4.0 > Reporter: Hyukjin Kwon > Priority: Major > > {code} > File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in > pyspark.pandas.mlflow.load_model > Failed example: > prediction_df > Exception raised: > Traceback (most recent call last): > File "/usr/lib/python3.9/doctest.py", line 1336, in __run > exec(compile(example.source, filename, "single", > File "<doctest pyspark.pandas.mlflow.load_model[18]>", line 1, in > <module> > prediction_df > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, in > __repr__ > pdf = cast("DataFrame", > self._get_or_create_repr_pandas_cache(max_display_count)) > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, in > _get_or_create_repr_pandas_cache > self, "_repr_pandas_cache", {n: self.head(n + > 1)._to_internal_pandas()} > File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, in > _to_internal_pandas > return self._internal.to_pandas_frame > File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in > wrapped_lazy_property > setattr(self, attr_name, fn(self)) > File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, > in to_pandas_frame > pdf = sdf.toPandas() > File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line > 208, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack > trace below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 829, in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 821, in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 345, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 86, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 338, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 519, in func > for result_batch, result_type in result_iter: > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1253, in udf > yield _predict_row_batch(batch_predict_fn, row_batch_args) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1057, in _predict_row_batch > result = predict_fn(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line > 1237, in batch_predict_fn > return loaded_model.predict(pdf) > File > "/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 413, > in predict > return self._predict_fn(data) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 355, in predict > return self._decision_function(X) > File > "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_base.py", line > 338, in _decision_function > X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], > reset=False) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 518, in _validate_data > self._check_feature_names(X, reset=reset) > File "/usr/local/lib/python3.9/dist-packages/sklearn/base.py", line > 451, in _check_feature_names > raise ValueError(message) > ValueError: The feature names should match those that were passed during > fit. > Feature names unseen at fit time: > - 0 > - 1 > Feature names seen at fit time, yet now missing: > - x1 > - x2 > {code} > https://github.com/apache/spark/actions/runs/3871715040/jobs/6600578830 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org