Tensor column types in arrow

2018-04-09 Thread Leif Walsh
Hi all, I’ve been doing some work lately with Spark’s ML interfaces, which include sparse and dense Vector and Matrix types, backed on the Scala side by Breeze. Using these interfaces, you can construct DataFrames whose column types are vectors and matrices, and though the API isn’t terribly rich,

Re: Tensor column types in arrow

2018-04-09 Thread Li Jin
As far as I know, there is an implementation of tensor type in C++/Python already. Should we just finalize the spec and add implementation to Java? On the Spark side, it's probably more complicated as Vector and Matrix are not "first class" types in Spark SQL. Spark ML implements them as UDT (user

Re: Tensor column types in arrow

2018-04-09 Thread Leif Walsh
The tensor type in the c++ api is a stand-alone object afaict, Phillip and I were unable to construct an arrow column of them. I agree that it’s a good starting point, one interpretation of what I’m suggesting is that we take it as the reference implementation, add it to the spec, and write the jav

Re: Tensor column types in arrow

2018-04-09 Thread Wes McKinney
> As far as I know, there is an implementation of tensor type in C++/Python > already. Should we just finalize the spec and add implementation to Java? There is nothing specified yet as far as a *column* of ndarrays/tensors. We defined Tensor metadata for the purposes of IPC/serialization but mad

Re: Tensor column types in arrow

2018-04-09 Thread Leif Walsh
My gut feeling is that such a column type should specify both the shape and primitive type of all values in the column. I can’t think of a common use case that requires differently shaped tensors in a single column. Can anyone here come up with such a use case? If not, I can try to draft a propos

Re: Tensor column types in arrow

2018-04-10 Thread Li Jin
What do people think whether "shape" should be included as a optional part of schema metadata or a required part of the schema itself? I feel having it be required might be too restrictive for interop with other systems. On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh wrote: > My gut feeling is that

Re: Tensor column types in arrow

2018-04-10 Thread Wes McKinney
The simplest thing would be to have a "tensor" or "ndarray" type where each cell has the same shape. This would amount to adding the current "Tensor" Flatbuffers table to the Type union in https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 The benefit of having each cell having th

Re: Tensor column types in arrow

2018-04-10 Thread Leif Walsh
Thanks, I’ll create a jira and google doc. I agree those are the main questions to iron out. If there’s a desire to avoid scope creeping this in before 1.0, I think in parallel I’ll start a conversation with the spark community about using the existing FixedSizeBinary type plus some custom metadat