dtenedor opened a new pull request, #43204:
URL: https://github.com/apache/spark/pull/43204

   ### What changes were proposed in this pull request?
   
   This PR adds a Python UDTF API for 'analyze' to return a buffer to consume 
on each class creation.
   
   * The `AnalyzeResult` class now contains a new string field `prepare_buffer`.
   * If assigned to a non-empty value, the UDTF should have another method 
`prepare` that accepts the string argument, which will get called after 
`__init__` when the class is  created.
   * The format of the buffer is opaque and known only to the UDTF. Common 
use-cases including serializing protocol buffers or JSON objects into the 
buffer in order to help organize the contents therein.
   
   For example, this UDTF accepts a constant scalar string argument, then 
assigns this value to the buffer.
   
   ```
   @udtf
   class TestUDTF:
       def __init__(self):
           self._total = 0
           self._buffer = None
   
       @staticmethod
       def analyze(argument, _):
           return AnalyzeResult(
               schema=StructType().add("total", IntegerType()).add("buffer", 
StringType()),
               prepare_buffer=argument.value,
               with_single_partition=True)
   
       def prepare(self, buffer):
           self._buffer = buffer
           self._total = len(buffer)
   
       def eval(self, argument, row: Row):
           self._total += 1
   
       def terminate(self):
           yield self._total, self._buffer
   ```
   
   ### Why are the changes needed?
   
   In this way, the UDTF can perform potentially expensive initialization logic 
in the `analyze` method just once and result the result of such initialization 
rather than repeating the initialization in `eval`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR adds new unit test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to