Hi,

This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark.
I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

      *  STDDEV(amount_sold) AS Standard_deviation*

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)

Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible
through functional programming?

Thanks,

Mich




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to