Dataset?

Jeremy Smith Fri, 05 Aug 2016 13:15:02 -0700

If you had a persistent, off-heap buffer of Arrow data on each executor,
and you could get an iterator over that buffer from inside of a task, then
you could conceivably define an RDD over it by just extending RDD and
returning the iterator from the compute method.  If you want to make a
Dataset or DataFrame, though, it's going to be tough to avoid copying the
data.  You can't avoid Spark copying data into InternalRows unless your RDD
is an RDD[InternalRow] and you create a BaseRelation for it that specifies
needsConversion = false.  It might be possible to implement InternalRow
over your Arrow buffer, but I'm still fuzzy on whether nor not that would
prevent copying/marshaling of the data.  Maybe one of the actual
contributors on Spark SQL will chime in with deeper knowledge.


Jeremy

On Fri, Aug 5, 2016 at 12:43 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Spark does not currently support Apache Arrow - probably a good place to
> chat would be on the Arrow mailing list where they are making progress
> towards unified JVM & Python/R support which is sort of a precondition of a
> functioning Arrow interface between Spark and Python.
>
> On Fri, Aug 5, 2016 at 12:40 PM, jpivar...@gmail.com <jpivar...@gmail.com>
> wrote:
>
>> In a few earlier posts [ 1
>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-td13898.html
>> >
>> ] [ 2
>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-td17701.html
>> >
>> ], I asked about moving data from C++ into a Spark data source (RDD,
>> DataFrame, or Dataset). The issue is that even the off-heap cache might
>> not
>> have a stable representation: it might change from one version to the
>> next.
>>
>> I recently learned about Apache Arrow, a data layer that Spark currently
>> or
>> will someday share with Pandas, Impala, etc. Suppose that I can fill a
>> buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there
>> an
>> easy--- or even zero-copy--- way to use that in Spark? Is that an API that
>> could be developed?
>>
>> I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place
>> to
>> ask this question?
>>
>> Thanks,
>> -- Jim
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-RDD-DataFrame-Dataset-tp18563.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

Reply via email to