[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs for Python

Reynold Xin (JIRA) Fri, 23 Jun 2017 00:50:35 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Reynold Xin updated SPARK-21190:
--------------------------------
    Description: 
*Background and Motivation*
 
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
that are written in native code. This proposal advocates introducing new APIs 
to support vectorized UDFs in Python, in which a block of data is transferred 
over to Python in some column format for execution.
 
 
*Target Personas*

Data scientists, data engineers, library developers.
 

*Goals*

... todo ...
 

*Non-Goals*

- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
 
 
*Proposed API Changes*
 
... todo ...
 
 
 
*Optional Design Sketch*
The implementation should be pretty straightforward and is not a huge concern 
at this point. I’m more concerned about getting proper feedback for API design.
 
 
*Optional Rejected Designs*
See above.
 
 
 
 


> SPIP: Vectorized UDFs for Python
> --------------------------------
>
>                 Key: SPARK-21190
>                 URL: https://issues.apache.org/jira/browse/SPARK-21190
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, SQL
>    Affects Versions: 2.2.0
>            Reporter: Reynold Xin
>              Labels: SPIP
>
> *Background and Motivation*
>  
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> that are written in native code. This proposal advocates introducing new APIs 
> to support vectorized UDFs in Python, in which a block of data is transferred 
> over to Python in some column format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> ... todo ...
>  
> *Non-Goals*
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
>  
>  
> *Proposed API Changes*
>  
> ... todo ...
>  
>  
>  
> *Optional Design Sketch*
> The implementation should be pretty straightforward and is not a huge concern 
> at this point. I’m more concerned about getting proper feedback for API 
> design.
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs for Python

Reply via email to