Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread jpivar...@gmail.com
In a few earlier posts [ 1

 
] [ 2

 
], I asked about moving data from C++ into a Spark data source (RDD,
DataFrame, or Dataset). The issue is that even the off-heap cache might not
have a stable representation: it might change from one version to the next.

I recently learned about Apache Arrow, a data layer that Spark currently or
will someday share with Pandas, Impala, etc. Suppose that I can fill a
buffer (such as a direct ByteBuffer) with Arrow-formatted data, is there an
easy--- or even zero-copy--- way to use that in Spark? Is that an API that
could be developed?

I'll be at the KDD Spark 2.0 tutorial on August 15. Is that a good place to
ask this question?

Thanks,
-- Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Arrow-data-in-buffer-to-RDD-DataFrame-Dataset-tp18563.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to access the off-heap representation of cached data in Spark 2.0

2016-05-29 Thread jpivar...@gmail.com
Okay, that makes sense.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17723.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to access the off-heap representation of cached data in Spark 2.0

2016-05-29 Thread jpivar...@gmail.com
Thanks Jacek and Kazuaki,

I guess got the wrong impression about the C++ API from somewhere--- I think
I read it in a JIRA wish list. If the byte array is accessed through
sun.misc.Unsafe, that's what I mean by off-heap. I found the Platform class,
which provides access to the bytes in Unsafe in a uniform way.

Thanks for pointing me to CachedBatch (private). If I find a way to provide
access by modifying Spark source, can I just submit a pull request, or do I
need to be a recognized Spark developer? If so, is there a process for
becoming one?

-- Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17721.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to access the off-heap representation of cached data in Spark 2.0

2016-05-28 Thread jpivar...@gmail.com
Is this not the place to ask such questions? Where can I get a hint as to how
to access the new off-heap cache, or C++ API, if it exists? I'm willing to
do my own research, but I have to have a place to start. (In fact, this is
the first step in that research.)

Thanks,
-- Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How to access the off-heap representation of cached data in Spark 2.0

2016-05-26 Thread jpivar...@gmail.com
Following up on an  earlier thread

 
, I would like to access the off-heap representation of cached data in Spark
2.0 in order to see how Spark might be linked to physics software written in
C and C++.
I'm willing to do exploration on my own, but could somebody point me to a
place to start? I have downloaded the 2.0 preview and created a persisted
Dataset:
import scala.util.Randomcase class Muon(px: Double, py: Double) {  def pt =
Math.sqrt(px*px + py*py)}val rdd = sc.parallelize(0 until 1 map {x => 
Muon(Random.nextGaussian, Random.nextGaussian)  }, 10)val df = rdd.toDFval
ds = df.as[Muon]ds.persist()
So I have a Dataset in memory, and if I understand the  blog articles

  
correctly, it's in off-heap memory (sun.misc.Unsafe). Is there any way I
could get a pointer to that data that I could explore with BridJ? Any hints
on how it's stored? Like, could I get started through some Djinni calls or
something?
Thanks!
-- Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Tungsten off heap memory access for C++ libraries

2016-04-28 Thread jpivar...@gmail.com
jpivar...@gmail.com wrote
> P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA,
> BridJ, and JavaCPP personally, but in the end picked JNA because of its
> (comparatively) large user base. If Spark will be using Djinni, that could
> be a symmetry-breaking consideration and I'll start using it for
> consistency, maybe even interoperability.

I think I misunderstood what Djinni is. JNA, BridJ, and JavaCPP provide
access to untyped bytes (except for common cases like java.lang.String), but
it looks like Djinni goes further and provides a type mapping--- exactly the
"serialization format" or "layout of bytes" that I was asking about.

Is it safe to say that when Spark has off-heap caching, that it will be in
the format specified by Djinni? If I work to integrate ROOT with Djinni,
will this be a major step toward integrating it with Spark 2.0?

Even if the above answers my first question, I'd still like to know if the
new Spark API will allow RDDs to be /filled/ from the C++ side, as a data
source, rather than a derived dataset.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p17388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Tungsten off heap memory access for C++ libraries

2016-04-28 Thread jpivar...@gmail.com
Hi,

I'm coming from the particle physics community and I'm also very interested
in the development of this project. We have a huge C++ codebase and would
like to start using the higher-level abstractions of Spark in our data
analyses. To this end, I've been developing code that copies data from our
C++ framework, ROOT, into Scala:

https://github.com/diana-hep/rootconverter/tree/master/scaroot-reader
  

(Worth noting: the ROOT file format is too complex for a complete rewrite in
Java or Scala to be feasible. ROOT readers in Java and even Javascript
exist, but they only handle simple cases.)

I have a variety of options for how to lay out the bytes during this
transfer, and in all cases fill the constructor arguments of Scala classes
using macros. When I learned that you're moving the Spark data off-heap (at
the same time as I'm struggling to move it on-heap), I realized that you
must have chosen a serialization format for that data, and I should be using
/that/ serialization format.

Even though it's early, do you have any designs for that serialization
format? Have you picked a standard one? Most of the options, such as Avro,
don't make a lot of sense because they pack integers to minimize number of
bytes, rather than lay them out for efficient access (including any
byte-alignment considerations).

Also, are there any plans for an API that /fills/ an RDD or DataSet from the
C++ side, as I'm trying to do?

Thanks,
-- Jim


P.S. Concerning Java/C++ bindings, there are many. I tried JNI, JNA, BridJ,
and JavaCPP personally, but in the end picked JNA because of its
(comparatively) large user base. If Spark will be using Djinni, that could
be a symmetry-breaking consideration and I'll start using it for
consistency, maybe even interoperability.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-off-heap-memory-access-for-C-libraries-tp13898p17387.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org