[ https://issues.apache.org/jira/browse/SPARK-31450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233054#comment-17233054 ]
Navin Viswanath commented on SPARK-31450: ----------------------------------------- [~hvanhovell] [~dongjoon] I was in the process of migrating some code from Spark 2.4 to Spark 3 and noticed that this required a change in our code. We use the following process to go from a Thrift type T to InternalRow(reading thrift files on HDFS into a Dataframe): # We construct a Spark schema by inspecting the thrift metadata. # We convert a thrift object to a GenericRow using the thrift metadata to read columns. # We then construct an ExpressionEncoder[Row] and use it to create an InternalRow as follows: {code:java} val schema: StructType = ... // infer thrift schema val encoder: ExpressionEncoder[Row] = RowEncoder(schema) val genericRow: GenericRow = toGenericRow(thriftObject, schema) val internalRow: InternalRow = encoder.toRow(genericRow) {code} The above steps are used to implement {code:java} protected def buildReader( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] {code} in trait org.apache.spark.sql.execution.datasources.FileFormat where we need an Iterator[InternalRow]. With the change in this ticket, I would have to replace {code:java} val internalRow: InternalRow = encoder.toRow(genericRow) {code} with {code:java} val serializer = encoder.createSerializer() val internalRow: InternalRow = serializer(genericRow){code} Since this is marked as an internal API in the PR, I was wondering if there is a way to implement this so that it is compatible with both Spark 2.4 and Spark 3. My goal is to not require a code change if possible. It seems to me that since I know the schema of the thrift type it should be possible to construct an InternalRow, but I don't see a way to do this in the code base. > Make ExpressionEncoder thread safe > ---------------------------------- > > Key: SPARK-31450 > URL: https://issues.apache.org/jira/browse/SPARK-31450 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Herman van Hövell > Assignee: Herman van Hövell > Priority: Major > Fix For: 3.0.0 > > > ExpressionEncoder is currently not thread-safe because it contains stateful > objects that are required for converting objects to internal rows and vise > versa. We have been working around this by (excessively) cloning > ExpressionEncoders which is not free. I propose that we move the stateful > bits of the expression encoder into two helper classes that will take care of > the conversions. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org