Re: Tensor column types in arrow

Leif Walsh Mon, 09 Apr 2018 14:35:39 -0700

The tensor type in the c++ api is a stand-alone object afaict, Phillip and
I were unable to construct an arrow column of them. I agree that it’s a
good starting point, one interpretation of what I’m suggesting is that we
take it as the reference implementation, add it to the spec, and write the
java implementation.
On Mon, Apr 9, 2018 at 17:30 Li Jin <[email protected]> wrote:


> As far as I know, there is an implementation of tensor type in C++/Python
> already. Should we just finalize the spec and add implementation to Java?
>
> On the Spark side, it's probably more complicated as Vector and Matrix are
> not "first class" types in Spark SQL. Spark ML implements them as UDT
> (user-defined types) so it's not clear how to make Spark/Arrow converter
> work with them.
>
> I wonder if Bryan and Holden have some more thoughts on that?
>
> Li
>
> On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <[email protected]> wrote:
>
> > Hi all,
> >
> > I’ve been doing some work lately with Spark’s ML interfaces, which
> include
> > sparse and dense Vector and Matrix types, backed on the Scala side by
> > Breeze. Using these interfaces, you can construct DataFrames whose column
> > types are vectors and matrices, and though the API isn’t terribly rich,
> it
> > is possible to run Python UDFs over such a DataFrame and get numpy
> ndarrays
> > out of each row. However, if you’re using Spark’s Arrow serialization
> > between the executor and Python workers, you get this
> > UnsupportedOperationException:
> > https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> > d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> > execution/arrow/ArrowWriter.scala#L71
> >
> > I think it would be useful for Arrow to support something like a column
> of
> > tensors, and I’d like to see if anyone else here is interested in such a
> > thing.  If so, I’d like to propose adding it to the spec and getting it
> > implemented in at least Java and C++/Python.
> >
> > Some initial mildly-scattered thoughts:
> >
> > 1. You can certainly represent these today as List<Double> and
> > List<List<Double>>, but then need to do some copying to get them back
> into
> > numpy ndarrays.
> >
> > 2. In some cases it might be useful to know that a column contains 3x3x4
> > tensors, for example, and not just that there are three dimensions as
> you’d
> > get with List<List<List<Double>>>.  This could constrain what operations
> > are meaningful (for example, in Spark you could imagine type checking
> that
> > verifies dimension alignment for matrix multiplication).
> >
> > 3. You could approximate that with a FixedSizeList and metadata about the
> > tensor shape.
> >
> > 4. But I kind of feel like this is generally useful enough that it’s
> worth
> > having one implementation of it (well, one for each runtime) in Arrow.
> >
> > 5. Or, maybe everyone here thinks Spark should just do this with
> metadata?
> >
> > Curious to hear what you all think.
> >
> > Thanks,
> > Leif
> >
> > --
> > --
> > Cheers,
> > Leif
> >
>
-- 
-- 
Cheers,
Leif

Re: Tensor column types in arrow

Reply via email to