Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'analyze' to return a buffer to consume on each class creation [spark]

via GitHub Mon, 09 Oct 2023 13:29:59 -0700


dtenedor commented on code in PR #43204:
URL: https://github.com/apache/spark/pull/43204#discussion_r1348084412



##########
python/pyspark/sql/udtf.py:
##########
@@ -107,12 +107,20 @@ class AnalyzeResult:
         If non-empty, this is a sequence of columns that the UDTF is 
specifying for Catalyst to
         sort the input TABLE argument by. Note that the 'partition_by' list 
must also be non-empty
         in this case.
+    prepare_buffer: str
+        If non-empty, this string represents state computed once within the 
'analyze' method to be
+        propagated to each instance of the UDTF class at the time of its 
creation, using its
+        'prepare' method. The format this buffer is opaque and known only to 
the data source. Common
+        use cases include serializing protocol buffers or JSON configurations 
into this buffer so
+        that potentially expensive initialization work done at 'analyze' time 
does not need to be
+        recomputed later.
     """
 
     schema: StructType
     with_single_partition: bool = False
     partition_by: Sequence[PartitioningColumn] = field(default_factory=tuple)
     order_by: Sequence[OrderingColumn] = field(default_factory=tuple)
+    prepare_buffer: str = ""

Review Comment:
   We decided to remove this and just pickle the AnalyzeResult instance itself. 
So this AnalyzeResult class doesn't need to change.



##########
python/pyspark/sql/udtf.py:
##########
@@ -107,12 +107,20 @@ class AnalyzeResult:
         If non-empty, this is a sequence of columns that the UDTF is 
specifying for Catalyst to
         sort the input TABLE argument by. Note that the 'partition_by' list 
must also be non-empty
         in this case.
+    prepare_buffer: str
+        If non-empty, this string represents state computed once within the 
'analyze' method to be
+        propagated to each instance of the UDTF class at the time of its 
creation, using its
+        'prepare' method. The format this buffer is opaque and known only to 
the data source. Common
+        use cases include serializing protocol buffers or JSON configurations 
into this buffer so
+        that potentially expensive initialization work done at 'analyze' time 
does not need to be
+        recomputed later.
     """
 
     schema: StructType
     with_single_partition: bool = False
     partition_by: Sequence[PartitioningColumn] = field(default_factory=tuple)
     order_by: Sequence[OrderingColumn] = field(default_factory=tuple)
+    prepare_buffer: str = ""

Review Comment:
   We decided to remove this and just pickle the AnalyzeResult instance itself. 
So this AnalyzeResult class doesn't need to change.



##########
python/pyspark/sql/udtf.py:
##########
@@ -107,12 +107,20 @@ class AnalyzeResult:
         If non-empty, this is a sequence of columns that the UDTF is 
specifying for Catalyst to
         sort the input TABLE argument by. Note that the 'partition_by' list 
must also be non-empty
         in this case.
+    prepare_buffer: str

Review Comment:
   We decided to remove this and just pickle the `AnalyzeResult` instance 
itself. So this `AnalyzeResult` class doesn't need to change.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala:
##########
@@ -167,22 +169,26 @@ abstract class UnevaluableGenerator extends Generator {
  * @param udfDeterministic true if this function is deterministic wherein it 
returns the same result
  *                         rows for every call with the same input arguments
  * @param resultId unique expression ID for this function invocation
- * @param pythonUDTFPartitionColumnIndexes holds the indexes of the TABLE 
argument to the Python
- *                                         UDTF call, if applicable
+ * @param pythonUDTFPartitionColumnIndexes holds the zero-based indexes of the 
projected results of
+ *                                         all PARTITION BY expressions within 
the TABLE argument of
+ *                                         the Python UDTF call, if applicable
  * @param analyzeResult holds the result of the polymorphic Python UDTF 
'analze' method, if the UDTF
  *                      defined one
  */
 case class PythonUDTF(
     name: String,
     func: PythonFunction,
-    elementSchema: StructType,
+    analyzeResult: PythonUDTFAnalyzeResult,

Review Comment:
   We talked offline and now we need it because we need to pickle and send it 
back to the UDTF.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45402][SQL][PYTHON] Add UDTF API for 'analyze' to return a buffer to consume on each class creation [spark]

Reply via email to