Writing protobuf RDD to parquet

David Diebold Fri, 20 Jan 2023 01:27:37 -0800

Hello,

I'm trying to write to parquet some RDD[T] where T is a protobuf message,
in scala.
I am wondering what is the best option to do this, and I would be
interested by your lights.
So far, I see two possibilities:
- use PairRDD method *saveAsNewAPIHadoopFile*, and I guess I need to
call *ParquetOutputFormat.setWriteSupportClass
*and *ProtoParquetOutputFormat.setProtobufClass *before. But in that case,
I'm not sure I have much control on how to partition files in different
folders on file system.
- or convert the RDD to dataframe then use *write.parquet ; *in that case,
I have more control, in case rely on *partitionBy *to arrange the files in
different folders. But I'm not sure there is some built-in way to convert
rdd of protobuf to dataframe in spark ? I would need to rely on this :
https://github.com/saurfang/sparksql-protobuf.


What do you think ?
Kind regards,
David

Writing protobuf RDD to parquet

Reply via email to