rangadi commented on code in PR #41192: URL: https://github.com/apache/spark/pull/41192#discussion_r1197260139
########## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/functions.scala: ########## @@ -148,8 +212,38 @@ object functions { messageName: String, descFilePath: String, options: java.util.Map[String, String]): Column = { + val fileContent = ProtobufUtils.readDescriptorFileContent(descFilePath) Review Comment: Didn't want to carry the file path further down in those classes. This avoids even accidental read of those files. Let me know if you see any advantages. It might simplify changes to test suite files in this PR, but this is just one time cost. ########## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala: ########## @@ -26,14 +26,14 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, messageName: String, - descFilePath: Option[String] = None, + fileDescriptorSetBytesOpt: Option[Array[Byte]] = None, Review Comment: @advancedxy suggested using broadcast for distributing this buffer ([in the comment here](https://github.com/apache/spark/pull/41192#issuecomment-1551260315)): Moving that discussion to code comment since it has better threading. @advancedxy most of the time size would be in kilobytes. It is possible some customers have huge protobuf repo and they bundled everything. Using broadcast is a good idea. I need to read up more on it. Can we do that as an implementation detail (so that users don't need be aware of it). We also need to make sure it is cleaned up. I will add a comment. Do you think we can do that an enhancement later? If we do it as implementation detail, it should work fine with Spark-Connect too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org