nickstanishadb commented on code in PR #44678: URL: https://github.com/apache/spark/pull/44678#discussion_r1453648020
########## python/pyspark/sql/udtf.py: ########## @@ -133,12 +133,28 @@ class AnalyzeResult: If non-empty, this is a sequence of expressions that the UDTF is specifying for Catalyst to sort the input TABLE argument by. Note that the 'partitionBy' list must also be non-empty in this case. + acquireExecutionMemoryMbRequested: long + If this is not None, this represents the amount of memory in megabytes that the UDTF should + request from each Spark executor that it runs on. Then the UDTF takes responsibility to use + at most this much memory, including all allocated objects. The purpose of this functionality + is to prevent executors from crashing by running out of memory due to the extra memory + consumption invoked by the UDTF's 'eval' and 'terminate' and 'cleanup' methods. Spark will + then call 'TaskMemoryManager.acquireExecutionMemory' with the requested number of megabytes. + acquireExecutionMemoryMbActual: long + If there is a task context available, Spark will assign this field to the number of + megabytes returned from the call to the TaskMemoryManager.acquireExecutionMemory' method, as + consumed by the UDTF's'__init__' method. Therefore, its 'eval' and 'terminate' and 'cleanup' + methods will know it thereafter and can ensure to bound memory usage to at most this number. + Note that there is no effect if the UDTF's 'analyze' method assigns a value to this; it will + be overwritten. """ Review Comment: Makes sense! What do you think about this being difficult to set, especially for UDTF developers? If you think the test I did with `pyUdtfMemProfile` is a reasonable estimate, what do we think of setting a global `minMemoryMb` to something like 100MB? I think that could make the manual memory assignment less prone to user error. It's probably also good to have a floor on some level that way the number of UDTFs simultaneously running on an executor has a hard ceiling -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org