If you had a persistent, off-heap buffer of Arrow data on each executor, and you could get an iterator over that buffer from inside of a task, then you could conceivably define an RDD over it by just extending RDD and returning the iterator from the compute method. If you want to make a Dataset or DataFrame, though, it's going to be tough to avoid copying the data. You can't avoid Spark copying data into InternalRows unless your RDD is an RDD[InternalRow] and you create a BaseRelation for it that specifies needsConversion = false. It might be possible to implement InternalRow over your Arrow buffer, but I'm still fuzzy on whether nor not that would prevent copying/marshaling of the data. Maybe one of the actual contributors on Spark SQL will chime in with deeper knowledge.
Jeremy On Fri, Aug 5, 2016 at 12:43 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Spark does not currently support Apache Arrow - probably a good place to > chat would be on the Arrow mailing list where they are making progress > towards unified JVM & Python/R support which is sort of a precondition of a > functioning Arrow interface between Spark and Python. > > On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com <jpivar...@gmail.com> > wrote: > >> In a few earlier posts [ 1 >> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-td13898.html >> > >> ] [ 2 >> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-td17701.html >> > >> ], I asked about moving data from C++ into a Spark data source (RDD, >> DataFrame, or Dataset). The issue is that even the off-heap cache might >> not >> have a stable representation: it might change from one version to the >> next. >> >> I recently learned about Apache Arrow, a data layer that Spark currently >> or >> will someday share with Pandas, Impala, etc. Suppose that I can fill a >> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there >> an >> easy--- or even zero-copy--- way to use that in Spark? Is that an API that >> could be developed? >> >> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place >> to >> ask this question? >> >> Thanks, >> -- Jim >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-RDD-DataFrame-Dataset-tp18563.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau >