[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8632:
---
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8632:
---
Priority: Blocker  (was: Major)

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-08-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8632:

Target Version/s: 1.6.0  (was: 1.5.0)

 Poor Python UDF performance because of RDD caching
 --

 Key: SPARK-8632
 URL: https://issues.apache.org/jira/browse/SPARK-8632
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Justin Uang

 {quote}
 We have been running into performance problems using Python UDFs with 
 DataFrames at large scale.
 From the implementation of BatchPythonEvaluation, it looks like the goal was 
 to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
 two passes over the data. One to give to the PythonRDD, then one to join the 
 python lambda results with the original row (which may have java objects that 
 should be passed through).
 In addition, it caches all the columns, even the ones that don't need to be 
 processed by the Python UDF. In the cases I was working with, I had a 500 
 column table, and i wanted to use a python UDF for one column, and it ended 
 up caching all 500 columns. 
 {quote}
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8632:

Target Version/s: 1.5.0

 Poor Python UDF performance because of RDD caching
 --

 Key: SPARK-8632
 URL: https://issues.apache.org/jira/browse/SPARK-8632
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Justin Uang

 {quote}
 We have been running into performance problems using Python UDFs with 
 DataFrames at large scale.
 From the implementation of BatchPythonEvaluation, it looks like the goal was 
 to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
 two passes over the data. One to give to the PythonRDD, then one to join the 
 python lambda results with the original row (which may have java objects that 
 should be passed through).
 In addition, it caches all the columns, even the ones that don't need to be 
 processed by the Python UDF. In the cases I was working with, I had a 500 
 column table, and i wanted to use a python UDF for one column, and it ended 
 up caching all 500 columns. 
 {quote}
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8632:

Shepherd: Davies Liu

 Poor Python UDF performance because of RDD caching
 --

 Key: SPARK-8632
 URL: https://issues.apache.org/jira/browse/SPARK-8632
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Justin Uang

 {quote}
 We have been running into performance problems using Python UDFs with 
 DataFrames at large scale.
 From the implementation of BatchPythonEvaluation, it looks like the goal was 
 to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
 two passes over the data. One to give to the PythonRDD, then one to join the 
 python lambda results with the original row (which may have java objects that 
 should be passed through).
 In addition, it caches all the columns, even the ones that don't need to be 
 processed by the Python UDF. In the cases I was working with, I had a 500 
 column table, and i wanted to use a python UDF for one column, and it ended 
 up caching all 500 columns. 
 {quote}
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org