Hi all,

I’ve been doing some work lately with Spark’s ML interfaces, which include
sparse and dense Vector and Matrix types, backed on the Scala side by
Breeze. Using these interfaces, you can construct DataFrames whose column
types are vectors and matrices, and though the API isn’t terribly rich, it
is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
out of each row. However, if you’re using Spark’s Arrow serialization
between the executor and Python workers, you get this
UnsupportedOperationException:
https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L71

I think it would be useful for Arrow to support something like a column of
tensors, and I’d like to see if anyone else here is interested in such a
thing.  If so, I’d like to propose adding it to the spec and getting it
implemented in at least Java and C++/Python.

Some initial mildly-scattered thoughts:

1. You can certainly represent these today as List<Double> and
List<List<Double>>, but then need to do some copying to get them back into
numpy ndarrays.

2. In some cases it might be useful to know that a column contains 3x3x4
tensors, for example, and not just that there are three dimensions as you’d
get with List<List<List<Double>>>.  This could constrain what operations
are meaningful (for example, in Spark you could imagine type checking that
verifies dimension alignment for matrix multiplication).

3. You could approximate that with a FixedSizeList and metadata about the
tensor shape.

4. But I kind of feel like this is generally useful enough that it’s worth
having one implementation of it (well, one for each runtime) in Arrow.

5. Or, maybe everyone here thinks Spark should just do this with metadata?

Curious to hear what you all think.

Thanks,
Leif

-- 
-- 
Cheers,
Leif

Reply via email to