Hello,

I've (nearly) implemented a DSV2-reader interface to read particle physics
data stored in the ROOT (https://root.cern.ch/) file format. You can think
of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a
column can be of type "float[]", meaning each row in the column is a
variable-length  array of floats). The overwhelming majority of our columns
are these variable-length arrays, since they represent physical quantities
that vary widely with each particle collision*.

Exposing these columns via the "SupportsScanColumnarBatch" interface has
raised a question I have about the DSV2 API. I know the interface is
currently Evolving, but I don't know if this is the appropriate place to
ask about it (I presume JIRA is a good place as well, but I had trouble
finding exactly where the best place to join is)

There is no provision in the org.apache.spark.sql.vectorized.ColumnVector
interface to return multiple rows of arrays (i.e. no "getArrays" analogue
to "getArray"). A big use case we have is to pipe these data through UDFs,
so it would be nice to be able to get the data from the file into a UDF
batch without having to convert to an intermediate row-wise representation.
Looking into ColumnarArray, however, it seems like instead of storing a
single offset and length, it could be extended to arrays of "offsets" and
"lengths". The public interface could remain the same by adding a 2nd
constructor which accepts arrays and keeping the existing constructor as a
degenerate case of a 1-length array.


* e.g. "electron_momentum" column will have a different number of entries
each row, one for each electron that is produced in a collision.

Reply via email to