[jira] [Comment Edited] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517
 ] 

Xiangrui Meng edited comment on SPARK-27396 at 4/20/19 5:03 PM:


[~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it 
would be nice to state it clearly in the SPIP. Here are the parts that made me 
confused:

{quote}
Q1.

Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.

Q5

Anyone who wants to experiment with accelerated computing, either on a CPU or 
GPU, will benefit as it provides a supported way to make this work instead of 
trying to hack something in that was not available before.

Q8

The first check for success would be to successfully transition the existing 
columnar code over to using this.  That would be transforming and sending the 
data to python for processing by Pandas UDFs.
{quote}

No "ETL" was mentioned in the SPIP itself.

And if you want to propose an interface that is Arrow compatible, I don't think 
"end users" like data engineers/scientists want to use it to write UDFs. 
Correct me if I'm wrong, your target personas are developers who understand 
Arrow format and be able to plug in vectorized code. If so, it would be nice to 
make it clear in the SPIP too.

I'm +1 on columnar format. My doubt is how much benefit we get for ETL use 
cases from making it public. You mid-term success (Pandas UDF) can be achieved 
without exposing any public API. And I mentioned the current issue of Pandas 
UDF is not the data conversion but data pipelining and per-batch overhead. You 
didn't give a concrete example for the final success. Q4, you mentioned "we 
have already completed a proof of concept that shows columnar processing can be 
efficiently done in Spark". Could you say more about the POC and what do you 
mean by "in Spark"?

Could you provide a concrete ETL use case that benefits from this *public API*, 
does vectorization, and significantly boosts the performance?


was (Author: mengxr):
[~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it 
would be nice to state it clearly in the SPIP. Here are the parts that made me 
confused:

{quote}
Q1.

Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.

Q5

Anyone who wants to experiment with accelerated computing, either on a CPU or 
GPU, will benefit as it provides a supported way to make this work instead of 
trying to hack something in that was not available before.

Q8

The first check for success would be to successfully transition the existing 
columnar code over to using this.  That would be transforming and sending the 
data to python for processing by Pandas UDFs.
{quote}

And if you want to propose an interface that is Arrow compatible, I don't think 
"end users" like data engineers/scientists want to use it to write UDFs. 
Correct me if I'm wrong, your target personas are developers who understand 
Arrow format and be able to plug in vectorized code. If so, it would be nice to 
make it clear in the SPIP too.

I'm +1 on columnar format. My doubt is how much benefit we get for ETL use 
cases from making it public. You mid-term success (Pandas UDF) can be achieved 
without exposing any public API. And I mentioned the current issue of Pandas 
UDF is not the data conversion but data pipelining and per-batch overhead. You 
didn't give a concrete example for the final success. Q4, you mentioned "we 
have already completed a proof of concept that shows columnar processing can be 
efficiently done in Spark". Could you say more about the POC and what do you 
mean by "in Spark"?

Could you provide a concrete ETL use case that benefits from this *public API*, 
does vectorization, and significantly boosts the performance?

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout 

[jira] [Comment Edited] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517
 ] 

Xiangrui Meng edited comment on SPARK-27396 at 4/20/19 5:02 PM:


[~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it 
would be nice to state it clearly in the SPIP. Here are the parts that made me 
confused:

{quote}
Q1.

Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.

Q5

Anyone who wants to experiment with accelerated computing, either on a CPU or 
GPU, will benefit as it provides a supported way to make this work instead of 
trying to hack something in that was not available before.

Q8

The first check for success would be to successfully transition the existing 
columnar code over to using this.  That would be transforming and sending the 
data to python for processing by Pandas UDFs.
{quote}

And if you want to propose an interface that is Arrow compatible, I don't think 
"end users" like data engineers/scientists want to use it to write UDFs. 
Correct me if I'm wrong, your target personas are developers who understand 
Arrow format and be able to plug in vectorized code. If so, it would be nice to 
make it clear in the SPIP too.

I'm +1 on columnar format. My doubt is how much benefit we get for ETL use 
cases from making it public. You mid-term success (Pandas UDF) can be achieved 
without exposing any public API. And I mentioned the current issue of Pandas 
UDF is not the data conversion but data pipelining and per-batch overhead. You 
didn't give a concrete example for the final success. Q4, you mentioned "we 
have already completed a proof of concept that shows columnar processing can be 
efficiently done in Spark". Could you say more about the POC and what do you 
mean by "in Spark"?

Could you provide a concrete ETL use case that benefits from this *public API*, 
does vectorization, and significantly boosts the performance?


was (Author: mengxr):
[~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it 
would be nice to state it clearly in the SPIP. Here are the parts that made me 
confused:

{quote}
Q1.

Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.

Q5

Anyone who wants to experiment with accelerated computing, either on a CPU or 
GPU, will benefit as it provides a supported way to make this work instead of 
trying to hack something in that was not available before.

Q8

The first check for success would be to successfully transition the existing 
columnar code over to using this.  That would be transforming and sending the 
data to python for processing by Pandas UDFs.
{quote}

And if you want to propose an interface that is Arrow compatible, I don't think 
"end users" like data engineers/scientists want to use it to write UDFs. 
Correct me if I'm wrong, your target personas are developers who understand 
Arrow format and be able to plug in vectorized code. If so, it would be nice to 
make it clear in the SPIP too.

I'm +1 on columnar format. My doubt is how much benefit we get for ETL use 
cases from making it public. You mid-term success (Pandas UDF) can be achieved 
without exposing any public API. And I mentioned the current issue of Pandas 
UDF is not the data conversion but data pipelining and per-batch overhead. You 
didn't give a concrete example for the final success. Q4, you mentioned "we 
have already completed a proof of concept that shows columnar processing can be 
efficiently done in Spark". Could you say more about the POC and what do you 
mean by "in Spark"?

Could you provide a concrete ETL use case that benefits from this public API, 
does vectorization, and significantly boosts the performance?

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for 

[jira] [Comment Edited] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517
 ] 

Xiangrui Meng edited comment on SPARK-27396 at 4/20/19 5:01 PM:


[~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it 
would be nice to state it clearly in the SPIP. Here are the parts that made me 
confused:

{quote}
Q1.

Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.

Q5

Anyone who wants to experiment with accelerated computing, either on a CPU or 
GPU, will benefit as it provides a supported way to make this work instead of 
trying to hack something in that was not available before.

Q8

The first check for success would be to successfully transition the existing 
columnar code over to using this.  That would be transforming and sending the 
data to python for processing by Pandas UDFs.
{quote}

And if you want to propose an interface that is Arrow compatible, I don't think 
"end users" like data engineers/scientists want to use it to write UDFs. 
Correct me if I'm wrong, your target personas are developers who understand 
Arrow format and be able to plug in vectorized code. If so, it would be nice to 
make it clear in the SPIP too.

I'm +1 on columnar format. My doubt is how much benefit we get for ETL use 
cases from making it public. You mid-term success (Pandas UDF) can be achieved 
without exposing any public API. And I mentioned the current issue of Pandas 
UDF is not the data conversion but data pipelining and per-batch overhead. You 
didn't give a concrete example for the final success. Q4, you mentioned "we 
have already completed a proof of concept that shows columnar processing can be 
efficiently done in Spark". Could you say more about the POC and what do you 
mean by "in Spark"?

Could you provide a concrete ETL use case that benefits from this public API, 
does vectorization, and significantly boosts the performance?


was (Author: mengxr):
[~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it 
would be nice to state it clearly in the SPIP. Here are the parts that made me 
confused:

{quote}
Q1.

Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.

Q5

Anyone who wants to experiment with accelerated computing, either on a CPU or 
GPU, will benefit as it provides a supported way to make this work instead of 
trying to hack something in that was not available before.

Q8

The first check for success would be to successfully transition the existing 
columnar code over to using this.  That would be transforming and sending the 
data to python for processing by Pandas UDFs.
{quote}

And if you want to propose an interface that is Arrow compatible, I don't think 
"end users" like data engineers/scientists want to use it to write UDFs. 
Correct me if I'm wrong, your target personas are developers who understand 
Arrow format and be able to plug in vectorized code. If so, it would be nice to 
make it clear in the SPIP too.

I'm +1 on columnar format. My doubt is how much benefit we get for ETL use 
cases from making it public. You mid-term goal (Pandas UDF) can be achieved 
without exposing any public API. And I mentioned the current issue of Pandas 
UDF is not the data conversion but data pipelining and per-batch overhead. You 
didn't give a concrete example for the final success. Q4, you mentioned "we 
have already completed a proof of concept that shows columnar processing can be 
efficiently done in Spark". Could you say more about the POC and what do you 
mean by "in Spark"?

Could you provide a concrete ETL use case that benefits from this public API, 
does vectorization, and significantly boosts the performance?

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple