andreAmorimF commented on PR #55689: URL: https://github.com/apache/spark/pull/55689#issuecomment-4501395404
> @andreAmorimF I don't think this should be added as a Spark Connect API for a couple of reasons: > > * Spark Connect is supposed to be engine agnostic. Leaking execution details into the API is not really desirable. > * AFAICT the only reason why this would be needed is because you want to modify parallelism at some stage in the plan. At the end of the day this should be an engine problem, and we should try to fix it there. Hi @hvanhovell thanks for the reply. I think we do consider the execution details in the API already (ex: `repartition`) and i wonder if this API will also be removed eventually. So far, a great thing of using spark for me was to be able to tune this closer to the client and i don't see current work streams to perform the same on the engine side. Would you be willing to accept contributions in this direction? I think giving some heuristics to the engine on how the partitions of the query should be estimated via a configuration parameter (ex: MAX_SIZE_OF_INPUTS or MIN_SIZE_OF_INPUTS, etc.) could be a start to achieve what I want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
