DataStream API: Parquet File Format with Scala Case Classes

Ryan van Huuksloot Fri, 18 Feb 2022 14:22:04 -0800

Hello,

Context:


We are working on integrating Hybrid Sources with different Sources and
Sinks. I have been working on a Parquet source that allows users to load
the FileSource[T] so that the source can be used within Hybrid Sources
where the HybridSource is of Type[T].

The environment is Scala 2.12 and we are using the DataStream API. The
generic type “T” used in the email would be a Scala case class.

Problem:

Based on the documentation
<https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/formats/parquet/>,
it is recommended that you use the ParquetColumnarRowInputFormat
<https://github.com/apache/flink/blob/a060302379a7a53fe10f699625e36245e6a90d50/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormat.java>
as an entrypoint to set up the Source. Given that
ParquetColumnarRowInputFormat hard codes RowData
<https://github.com/apache/flink/blob/a060302379a7a53fe10f699625e36245e6a90d50/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormat.java#L48>,
your other sources would then need to be of Type[RowData] to be used in
HybridSource - from my experience - and you can’t convert
FileSource[RowData] -> FileSource[T].

An alternative I looked into was extending ParquetVectorizedInputFormat
<https://github.com/apache/flink/blob/a060302379a7a53fe10f699625e36245e6a90d50/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetVectorizedInputFormat.java>
but found that the type restrictions were difficult to reconcile.

Potential Solution:

Create a better AbstractParquetBulkFormat, similar to the
AbstractAvroBulkFormat
<https://github.com/apache/flink/blob/a060302379a7a53fe10f699625e36245e6a90d50/flink-formats/flink-avro/src/main/java/org/apache/flink/formats/avro/AbstractAvroBulkFormat.java>
added in 1.15. We would be available to contribute but want to understand
if this is a direction Flink is willing to go before putting in the work!

Questions:


   1.

   Based on the current implementation of Parquet within Flink, is it
   correct to say that the only entry-point for parquet is
   ParquetColumnarRowInputFormat
   
<https://github.com/apache/flink/blob/a060302379a7a53fe10f699625e36245e6a90d50/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormat.java>
   ?
   1.

      Is there any way to convert a FileSource[RowData] -> FileSource[T]?
      2.

   Would the potential solution listed above be an implementation that
   Flink would be interested in integrating?
   1.

      If not, do you have an example of Parquet being used in a
      HybridSource along with a Kafka Source?


Thanks!
Ryan van Huuksloot
Data Developer | Data Platform Engineering | Streaming Capabilities
[image: Shopify]
<https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>

DataStream API: Parquet File Format with Scala Case Classes

Reply via email to