[ https://issues.apache.org/jira/browse/SPARK-39994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577319#comment-17577319 ]
Muhammad Kaleem Ullah commented on SPARK-39994: ----------------------------------------------- Yes, I agree. > How to write (save) PySpark dataframe containing vector column? > --------------------------------------------------------------- > > Key: SPARK-39994 > URL: https://issues.apache.org/jira/browse/SPARK-39994 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.3.0 > Reporter: Muhammad Kaleem Ullah > Priority: Major > Attachments: df.PNG, error.PNG > > Original Estimate: 168h > Remaining Estimate: 168h > > I'm trying to same the PySpark dataframe after transforming it using ML > Pipeline. But when I save it the weird error is triggered every time. Here > are the columns of this dataframe: > |-- label: integer (nullable = true) > |-- dest_index: double (nullable = false) > |-- dest_fact: vector (nullable = true) > |-- carrier_index: double (nullable = false) > |-- carrier_fact: vector (nullable = true) > |-- features: vector (nullable = true) > And the following error occurs when trying to save this dataframe that > contains vector data: > {code:java} > // training.write.parquet("training_files.parquet", mode = "overwrite") {code} > {noformat} > Py4JJavaError: An error occurred while calling o440.parquet. : > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > ... > {noformat} > > I tried to use differently available {{winutils}} for Hadoop from [this > GitHub repository|https://github.com/cdarlint/winutils] but with not much > luck. Please help me in this regard. How can I save this dataframe so that I > can read it in any other jupyter notebook file? Feel free to ask any > questions. Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org