[jira] [Updated] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Reynold Xin (JIRA) Sat, 07 Oct 2017 00:18:08 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin updated SPARK-21404:
--------------------------------
    Issue Type: Sub-task  (was: Improvement)
        Parent: SPARK-22216

> Simple Vectorized Python UDFs using Arrow
> -----------------------------------------
>
>                 Key: SPARK-21404
>                 URL: https://issues.apache.org/jira/browse/SPARK-21404
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Using Arrow, Python UDFs can be evaluated in vectorized form by using the 
> column data as Pandas.Series.  This will offer a performance gain by 
> computing the return column data in one operation instead of iterating over 
> each row to calculate a single element and appending to a list, as is 
> currently done.  The existing Python UDF api can be used to implement this, 
> which specifies the return type, and since not all functions may be able to 
> be vectorized there would need to be a way to enable this optimizaiton, such 
> as a SQLConf.
> This is designed as a preliminary step for the existing SPIP: Vectorized UDFs 
> in Python SPARK-21190 that could be used as a basis for whatever expanded API 
> is decided upon there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21404) Simple Vectorized Python UDFs using Arrow

Reply via email to