dtenedor commented on code in PR #43204: URL: https://github.com/apache/spark/pull/43204#discussion_r1348084412
########## python/pyspark/sql/udtf.py: ########## @@ -107,12 +107,20 @@ class AnalyzeResult: If non-empty, this is a sequence of columns that the UDTF is specifying for Catalyst to sort the input TABLE argument by. Note that the 'partition_by' list must also be non-empty in this case. + prepare_buffer: str + If non-empty, this string represents state computed once within the 'analyze' method to be + propagated to each instance of the UDTF class at the time of its creation, using its + 'prepare' method. The format this buffer is opaque and known only to the data source. Common + use cases include serializing protocol buffers or JSON configurations into this buffer so + that potentially expensive initialization work done at 'analyze' time does not need to be + recomputed later. """ schema: StructType with_single_partition: bool = False partition_by: Sequence[PartitioningColumn] = field(default_factory=tuple) order_by: Sequence[OrderingColumn] = field(default_factory=tuple) + prepare_buffer: str = "" Review Comment: We decided to remove this and just pickle the AnalyzeResult instance itself. So this AnalyzeResult class doesn't need to change. ########## python/pyspark/sql/udtf.py: ########## @@ -107,12 +107,20 @@ class AnalyzeResult: If non-empty, this is a sequence of columns that the UDTF is specifying for Catalyst to sort the input TABLE argument by. Note that the 'partition_by' list must also be non-empty in this case. + prepare_buffer: str + If non-empty, this string represents state computed once within the 'analyze' method to be + propagated to each instance of the UDTF class at the time of its creation, using its + 'prepare' method. The format this buffer is opaque and known only to the data source. Common + use cases include serializing protocol buffers or JSON configurations into this buffer so + that potentially expensive initialization work done at 'analyze' time does not need to be + recomputed later. """ schema: StructType with_single_partition: bool = False partition_by: Sequence[PartitioningColumn] = field(default_factory=tuple) order_by: Sequence[OrderingColumn] = field(default_factory=tuple) + prepare_buffer: str = "" Review Comment: We decided to remove this and just pickle the AnalyzeResult instance itself. So this AnalyzeResult class doesn't need to change. ########## python/pyspark/sql/udtf.py: ########## @@ -107,12 +107,20 @@ class AnalyzeResult: If non-empty, this is a sequence of columns that the UDTF is specifying for Catalyst to sort the input TABLE argument by. Note that the 'partition_by' list must also be non-empty in this case. + prepare_buffer: str Review Comment: We decided to remove this and just pickle the `AnalyzeResult` instance itself. So this `AnalyzeResult` class doesn't need to change. ########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala: ########## @@ -167,22 +169,26 @@ abstract class UnevaluableGenerator extends Generator { * @param udfDeterministic true if this function is deterministic wherein it returns the same result * rows for every call with the same input arguments * @param resultId unique expression ID for this function invocation - * @param pythonUDTFPartitionColumnIndexes holds the indexes of the TABLE argument to the Python - * UDTF call, if applicable + * @param pythonUDTFPartitionColumnIndexes holds the zero-based indexes of the projected results of + * all PARTITION BY expressions within the TABLE argument of + * the Python UDTF call, if applicable * @param analyzeResult holds the result of the polymorphic Python UDTF 'analze' method, if the UDTF * defined one */ case class PythonUDTF( name: String, func: PythonFunction, - elementSchema: StructType, + analyzeResult: PythonUDTFAnalyzeResult, Review Comment: We talked offline and now we need it because we need to pickle and send it back to the UDTF. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org