advancedxy commented on PR #41192: URL: https://github.com/apache/spark/pull/41192#issuecomment-1551260315
I believe one main problem of carrying the byte buffer is that it's serialized and deserialized when scheduling tasks. When the `FileDescritptorSet` size is large enough or many protobuf functions are used, task size would be larger and cause some scheduling overhead. It would be much lightweighter to just carrying the file path name. To address that problem, we would normally broadcast the byte buffer, however that may not work well with spark connect? Do you think it's necessary to give users the option to pass by descriptor file a.k.a the current behavior ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org