Hi,
This is a shot in the dark so to speak. I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now The code as below sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, * STDDEV(amount_sold) AS Standard_deviation* FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext) Now if I wanted to use UDF based on numpy STD function, I can do import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType()) How can I use that udf with spark SQL? I gather this is only possible through functional programming? Thanks, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.