Custom file datasource implementations

Navin Viswanath Sun, 22 Nov 2020 10:36:30 -0800

Hi,
I’m looking into upgrading our Spark version from 2.4 to 3.0 and noticed that 
there was an API change that seems unavoidable. I wanted to check whether our 
implementation is the ideal way to do this and if this change is necessary.
We have files on HDFS in thrift format and implement custom datasources in V1 
by implementing a PartitionFile => InternalRow conversion(based on the 
FileFormat 
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala>
 trait). Here is how we do that currently:


Given a thrift type T,
val schema: StructType = ... // infer thrift schema for an object of type T
val encoder: ExpressionEncoder[Row] = RowEncoder(schema)
val genericRow: GenericRow = toGenericRow(thriftObject, schema) // converts a 
thrift object to a GenericRow

Once we have a GenericRow and an ExpressionEncoder, we used to do the following 
to produce an InternalRow:

val internalRow: InternalRow = encoder.toRow(genericRow)

With the change introduced in 
https://github.com/apache/spark/commit/fab4ca5156d5e1cc0e976c7c27b28a12fa61eb6d#diff-bcb03e1769f63da7a6ca53a3bb9c9d9f648d860520fa8de95b680c67d8c89fd9
 
<https://github.com/apache/spark/commit/fab4ca5156d5e1cc0e976c7c27b28a12fa61eb6d#diff-bcb03e1769f63da7a6ca53a3bb9c9d9f648d860520fa8de95b680c67d8c89fd9>

we now need:

val serializer = encoder.createSerializer()
val internalRow: InternalRow = serializer(genericRow)

I have a couple of questions about this:
1. Do we have to make this change in our sources going from 2.4 to 3.0? Since 
this change is marked as internal in the PR, I was wondering if this is not the 
ideal way to implement this.
2. Does moving to Datasource V2 avoid this change? From looking at the V2 API, 
it looks like I would still need to generate an InternalRow and an 
ExpressionEncoder appears to be the only way to do this, which implies I would 
still need this change.

My goal is to try and keep our source code the same going from 2.4 to 3.0 if 
possible.

Any help is appreciated!
Thanks,
Navin

Custom file datasource implementations

Reply via email to