zero323 commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type 
hints in pandas UDF and rename/move inconsistent pandas UDF types
URL: https://github.com/apache/spark/pull/27165#issuecomment-574124028
 
 
   @BryanCutler 
   
   > Using type hints can be a clear way to define the inputs/output of a UDF, 
but it might be problematic to be the only way a user can make a PandasUDF. 
What about if the user has a function they cannot modify to include type hints, 
or using lambda/partial functions? For these cases, it makes sense to use a 
wrapper like before, e.g. pandas_udf(func).
   
   While I agree with the conclusion (but there is no immediate plan to 
deprecate current style anyway) and had similar concerns these are mostly 
unjustified. Since suggested changes don't depend on static analysis, user is 
free to provide annotations on runtime:
   
   ```python
   from pyspark.sql.pandas.typehints import infer_eval_type
   from inspect import signature
   
   def f(x: pd.Series, y: pd.Series) -> pd.Series:
       return x + y
   
   infer_eval_type(signature(f))
   ## 200
   
   def g(x, y):
       return x + y
   
   infer_eval_type(signature(g))
   ## ValueError 
   ## ...
   ## ValueError: Type hints for all parameters should be specified; however, 
got (x, y)
   
   g.__annotations__ =  {
       'x': pd.Series,
        'y': pd.Series,
       'return': pd.Series
   }
   
   infer_eval_type(signature(g))
   ## 200
   ```
   
   Same approach can be used with `lambdas`, and in-lined with a simple 
wrapper, like this
   
   ```python
   def with_annotations(f, annotations):
       f.__annotations = annotations
       return annotations
   ```
   
   though given Pandas API design, lambdas are hardly useful.
   
   As of partial functions - I am aware of more than one popular 
implementation, but if we're talking about `functools.partial`, 
`infer_eval_type` could be easily adjusted to eliminate 
`functools.partial.args` and `functools.partial.keywords` before proceeding.
   
   >  I'd also worry that forcing type hints might be off-putting to some 
users, since they are not that widely used or optional. 
   
   This proposal hardly forces usage of type hints, and in fact, it makes usage 
of type hints harder. It should be rather seen as a syntactic sugar, utilizing 
long established mechanism. 
   
   Personally I don't have strong feelings about (concerns raised in the JIRA 
ticket and design document notwithstanding), but it should be pointed out, that 
mechanisms used here, are already utilized in the Python standard library, so 
cannot be seen as particularly exotic.
   
   > I'm also not too sure about changing some of the PandasUDFs to use regular 
functions, like with df.groupby.apply(udf).
   
   As mentioned by @HyukjinKwon I've agitated for this change and I don't want 
to repeat all the points I've raised before. I think it is important to address 
these two points:
   
   > I think it makes things less consistent by not using a pandas_udf for 
everything, 
   
   IMHO this is consistency from a perspective of a developer, not a user. It 
comes at significant price (growing number of execution types that user has to 
be aware of) and doesn't make any sense if you take a look at the types (please 
refer to design document for more detailed argument).
   
   We do have methods that have similar semantics in both Scala  ("strongly" 
typed `Dataset` transformations), in SparkR (already mentioned) and 3rd party 
extensions (`sparklyr`'s  `spark_apply`). None requires end users to be aware 
of nitty-gritty details of the internal execution model.
   
   It is a bit like saying that `KeyValueGroupedDataset,mapGroups` should be 
used as long as user provides corresponding `LogicalPlan` node. From where I 
stand forcing udf (with all the related complexity) is more a leaky 
abstraction, than a consistent design. 
   
   > and it could be inconvenient for the user to keep specifying the schema as 
an argument for multiple calls,
   
   For n-to-m, `pd.DataFrame` -> `pd.DataFrame` transformations users are as 
likely to have fixed schema, as schema varying based on the input Spark 
`DataFrame` (for example column additions and all-column aggregations). In such 
cases the alternative is to redefine `udf` for each possible input. Arguably 
this is at least as inconvenient and, in many cases hard to manage (as udf 
definitions have to be mixed with pipelines, to be able to access runtime 
schema).
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to