[jira] [Comment Edited] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517 ] Xiangrui Meng edited comment on SPARK-27396 at 4/20/19 5:03 PM: [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} No "ETL" was mentioned in the SPIP itself. And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term success (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this *public API*, does vectorization, and significantly boosts the performance? was (Author: mengxr): [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term success (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this *public API*, does vectorization, and significantly boosts the performance? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout
[jira] [Comment Edited] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517 ] Xiangrui Meng edited comment on SPARK-27396 at 4/20/19 5:02 PM: [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term success (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this *public API*, does vectorization, and significantly boosts the performance? was (Author: mengxr): [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term success (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this public API, does vectorization, and significantly boosts the performance? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for
[jira] [Comment Edited] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822517#comment-16822517 ] Xiangrui Meng edited comment on SPARK-27396 at 4/20/19 5:01 PM: [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term success (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this public API, does vectorization, and significantly boosts the performance? was (Author: mengxr): [~revans2] Thanks for clarifying the proposal! If your primary goal is ETL, it would be nice to state it clearly in the SPIP. Here are the parts that made me confused: {quote} Q1. Allow for simple data exchange with other systems, DL/ML libraries, pandas, etc. by having clean APIs to transform the columnar data into an Apache Arrow compatible layout. Q5 Anyone who wants to experiment with accelerated computing, either on a CPU or GPU, will benefit as it provides a supported way to make this work instead of trying to hack something in that was not available before. Q8 The first check for success would be to successfully transition the existing columnar code over to using this. That would be transforming and sending the data to python for processing by Pandas UDFs. {quote} And if you want to propose an interface that is Arrow compatible, I don't think "end users" like data engineers/scientists want to use it to write UDFs. Correct me if I'm wrong, your target personas are developers who understand Arrow format and be able to plug in vectorized code. If so, it would be nice to make it clear in the SPIP too. I'm +1 on columnar format. My doubt is how much benefit we get for ETL use cases from making it public. You mid-term goal (Pandas UDF) can be achieved without exposing any public API. And I mentioned the current issue of Pandas UDF is not the data conversion but data pipelining and per-batch overhead. You didn't give a concrete example for the final success. Q4, you mentioned "we have already completed a proof of concept that shows columnar processing can be efficiently done in Spark". Could you say more about the POC and what do you mean by "in Spark"? Could you provide a concrete ETL use case that benefits from this public API, does vectorization, and significantly boosts the performance? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple