Thanks guys <3. FYI, I made a PR for collect and vectorized dapply too. Given my tests, it boosts up the speed 1500%+, and 4600%+ each.
https://github.com/apache/spark/pull/23760 https://github.com/apache/spark/pull/23787 2019년 2월 11일 (월) 오전 4:45, Felix Cheung <felixcheun...@hotmail.com>님이 작성: > This is super awesome! > > > ------------------------------ > *From:* Shivaram Venkataraman <shiva...@eecs.berkeley.edu> > *Sent:* Saturday, February 9, 2019 8:33 AM > *To:* Hyukjin Kwon > *Cc:* dev; Felix Cheung; Bryan Cutler; Liang-Chi Hsieh; Shivaram > Venkataraman > *Subject:* Re: Vectorized R gapply[Collect]() implementation > > Those speedups look awesome! Great work Hyukjin! > > Thanks > Shivaram > > On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > > > > Guys, as continuation of Arrow optimization for R DataFrame to Spark > DataFrame, > > > > I am trying to make a vectorized gapply[Collect] implementation as an > experiment like vectorized Pandas UDFs > > > > It brought 820%+ performance improvement. See > https://github.com/apache/spark/pull/23746 > > > > Please come and take a look if you're interested in R APIs :D. I have > already cc'ed some people I know but please come, review and discuss for > both Spark side and Arrow side. > > > > This Arrow optimization job is being done under > https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to > take one if anyone of you is interested in it. > > > > Thanks. >