deanm0000 commented on PR #1074:
URL:
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2741334821
I wrote this script to help produce this PR. I wanted to start with just the
functions which have a single Expr input that return an Expr. The script writes
all those defs to a file and then I copied that over to the source and let the
linter fix the bad formatting that my code created.
```python
from inspect import signature, Parameter
from datafusion import functions as f, Expr
from types import FunctionType
from pathlib import Path
funcs_not_exprs = set(dir(f)) - set(dir(Expr))
funcs = []
for fun in funcs_not_exprs:
if isinstance(getattr(f, fun), FunctionType):
funcs.append(fun)
expr_in_out = {
"one_in_out": [],
"multi_expr": [],
"other_ins": [],
"other_return": [],
"other": [],
}
for fun in funcs:
sig = signature(getattr(f, fun))
params = sig.parameters
return_annotation = sig.return_annotation
if return_annotation != "Expr":
expr_in_out["other_return"].append(fun)
continue
all_expr = True
no_star = True
for name, param in params.items():
if param.annotation != "Expr":
all_expr = False
break
if param.kind in (Parameter.VAR_POSITIONAL, Parameter.KEYWORD_ONLY):
no_star=False
if len(params) == 1 and all_expr and no_star:
expr_in_out["one_in_out"].append(fun)
elif len(params) > 1 and all_expr:
expr_in_out["multi_expr"].append(fun)
elif len(params) > 1 and not all_expr:
expr_in_out["other_ins"].append(fun)
else:
expr_in_out["other"].append(fun)
expr_defs = Path("./expr_defs.py")
with expr_defs.open("w") as ff:
for fun in expr_in_out["one_in_out"]:
ff.write(f" def {fun}(self) -> Expr:\n")
docstring = getattr(f, fun).__doc__
if docstring is not None:
ff.write(' """')
docstring = docstring.strip()
ff.write(docstring)
ff.write('\n """\n')
ff.write(f" return F.{fun}(self)\n")
```
Before I do tests for all of them, I wanted to put this in the world for
feedback.
One additional idea that wasn't in the original PR would be to create
namespaces to group the category of function so instead of `col('a").tan()`
it'd be `col("a").trig.tan()`, `col("b").list.length()`,
`col("c").str.reverse()`, `col("d").dt.to_timestamp()`. That keeps there from
being too many available functions to choose from and it puts similar functions
together. That way if someone is working with datetimes then with a datetime
namespace, all the functions they look through are for datetimes. Similarly,
there wouldn't be datetime functions clogging up the root of Expr. Same for
trig, strings, lists, arrays, and anything else that deserves a namespace.
That, of course, requires some more manual effort to categorize the
functions.
As another forward thought, when it comes to functions that take extra Expr
inputs like `levenshtein`, I would also put in a convenience check where the
function would be
```python
def levenshtein(self, string2: Expr|str) -> Expr:
if isinstance(string2, str):
string2=col(string2)
return F.levenshtein(self, string2)
```
That would be consistent with polars wrt to literals, so if someone wanted
the levenshtein against a literal they'd have to use `lit("other_string")`
rather than just using the "other_string" directly.
For functions that take a number as the second then it'd use that directly,
such as:
```python
def pow(self, exponent: Expr | int | float) -> Expr:
if isinstance(exponent, (int, float)):
exponent=lit(exponent)
return f.pow(self, exponent)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]