+1 Thanks, Dongjoon.
On Fri, May 24, 2019 at 17:03 DB Tsai <dbt...@dbtsai.com.invalid> wrote: > +1 on exposing the APIs for columnar processing support. > > I understand that the scope of this SPIP doesn't cover AI / ML > use-cases. But I saw a good performance gain when I converted data > from rows to columns to leverage on SIMD architectures in a POC ML > application. > > With the exposed columnar processing support, I can imagine that the > heavy lifting parts of ML applications (such as computing the > objective functions) can be written as columnar expressions that > leverage on SIMD architectures to get a good speedup. > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Wed, May 15, 2019 at 2:59 PM Bobby Evans <reva...@gmail.com> wrote: > > > > It would allow for the columnar processing to be extended through the > shuffle. So if I were doing say an FPGA accelerated extension it could > replace the ShuffleExechangeExec with one that can take a ColumnarBatch as > input instead of a Row. The extended version of the ShuffleExchangeExec > could then do the partitioning on the incoming batch and instead of > producing a ShuffleRowRDD for the exchange they could produce something > like a ShuffleBatchRDD that would let the serializing and deserializing > happen in a column based format for a faster exchange, assuming that > columnar processing is also happening after the exchange. This is just like > providing a columnar version of any other catalyst operator, except in this > case it is a bit more complex of an operator. > > > > On Wed, May 15, 2019 at 12:15 PM Imran Rashid > <iras...@cloudera.com.invalid> wrote: > >> > >> sorry I am late to the discussion here -- the jira mentions using this > extensions for dealing with shuffles, can you explain that part? I don't > see how you would use this to change shuffle behavior at all. > >> > >> On Tue, May 14, 2019 at 10:59 AM Thomas graves <tgra...@apache.org> > wrote: > >>> > >>> Thanks for replying, I'll extend the vote til May 26th to allow your > >>> and other people feedback who haven't had time to look at it. > >>> > >>> Tom > >>> > >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <hol...@pigscanfly.ca> > wrote: > >>> > > >>> > I’d like to ask this vote period to be extended, I’m interested but > I don’t have the cycles to review it in detail and make an informed vote > until the 25th. > >>> > > >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <m...@databricks.com> > wrote: > >>> >> > >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I > don't feel strongly about it. I would still suggest doing the following: > >>> >> > >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC > result. > >>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick > check. Beside ColumnarBatch and ColumnarVector, we also need to make the > following public. People who are familiar with SQL internals should help > assess the risk. > >>> >> * ColumnarArray > >>> >> * ColumnarMap > >>> >> * unsafe.types.CaledarInterval > >>> >> * ColumnarRow > >>> >> * UTF8String > >>> >> * ArrayData > >>> >> * ... > >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't > match the purpose of this SPIP. It does make some code cleaner. But I guess > for ETL use cases, it won't bring much value. > >>> >> > >>> > -- > >>> > Twitter: https://twitter.com/holdenkarau > >>> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >