Hi all, I’ve been doing some work lately with Spark’s ML interfaces, which include sparse and dense Vector and Matrix types, backed on the Scala side by Breeze. Using these interfaces, you can construct DataFrames whose column types are vectors and matrices, and though the API isn’t terribly rich, it is possible to run Python UDFs over such a DataFrame and get numpy ndarrays out of each row. However, if you’re using Spark’s Arrow serialization between the executor and Python workers, you get this UnsupportedOperationException: https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L71
I think it would be useful for Arrow to support something like a column of tensors, and I’d like to see if anyone else here is interested in such a thing. If so, I’d like to propose adding it to the spec and getting it implemented in at least Java and C++/Python. Some initial mildly-scattered thoughts: 1. You can certainly represent these today as List<Double> and List<List<Double>>, but then need to do some copying to get them back into numpy ndarrays. 2. In some cases it might be useful to know that a column contains 3x3x4 tensors, for example, and not just that there are three dimensions as you’d get with List<List<List<Double>>>. This could constrain what operations are meaningful (for example, in Spark you could imagine type checking that verifies dimension alignment for matrix multiplication). 3. You could approximate that with a FixedSizeList and metadata about the tensor shape. 4. But I kind of feel like this is generally useful enough that it’s worth having one implementation of it (well, one for each runtime) in Arrow. 5. Or, maybe everyone here thinks Spark should just do this with metadata? Curious to hear what you all think. Thanks, Leif -- -- Cheers, Leif