[ https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772570#comment-16772570 ]
Hyukjin Kwon edited comment on SPARK-26858 at 2/20/19 3:02 AM: --------------------------------------------------------------- Yes .. that's matched with what I was thinking. Thanks for clarification, [~bryanc]. The discussion is getting long .. to make it short .. - Arrow requires to know its schema for its output - Schema can be only known after R native function execution for its output - It might be able to work around by, for instance, set the DataFrame's output as {{binary}} type for its output, and then construct Arrow streaming format by, somehow, lazily providing the schema (for its batch or {{Schema}}). Given my try, it's quite hacky since it basically means you need to send {{Schema}} from executor to driver (to be used in {{collect()}} later). (One other possibility I was thinking about batches without {{Schema}} is that we just send Arrow batch by Arrow batch, deserialize each batch to {{RecordBatch}} instance, and then construct an Arrow table, which is pretty different from Python side and hacky) was (Author: hyukjin.kwon): Yes .. that's matched with what I was thinking. Thanks for clarification, [~bryanc]. The discussion is getting long .. to make it short .. - Arrow requires to know its schema for its output - Schema can be only known after R native function execution for its output - It might be able to work around by, for instance, set the DataFrame's output as {{binary}} type for its output, and then construct Arrow streaming format by, somehow, lazily providing the schema (for its batch or {{Schema}}). Given my try, it's quite hacky since it basically means you need to send {{Schema}} from executor to driver (to be used in {{collect()}} later). (One other possibility I was thinking about batches without {{Schema}} is that we just send Arrow batch by Arrow batch, deserialize each batch to {{RecordBatch}} instance, and then construct an Arrow table, which is pretty different from Python side and hacky) Yes, it's currently se/deserialized row by row bytes over socket. > Vectorized gapplyCollect, Arrow optimization in native R function execution > --------------------------------------------------------------------------- > > Key: SPARK-26858 > URL: https://issues.apache.org/jira/browse/SPARK-26858 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL > Affects Versions: 3.0.0 > Reporter: Hyukjin Kwon > Assignee: Hyukjin Kwon > Priority: Major > > Unlike gapply, gapplyCollect requires additional ser/de steps because it can > omit the schema, and Spark SQL doesn't know the return type before actually > execution happens. > In original code path, it's done via using binary schema. Once gapply is done > (SPARK-26761). we can mimic this approach in vectorized gapply to support > gapplyCollect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org