Hello, I've (nearly) implemented a DSV2-reader interface to read particle physics data stored in the ROOT (https://root.cern.ch/) file format. You can think of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a column can be of type "float[]", meaning each row in the column is a variable-length array of floats). The overwhelming majority of our columns are these variable-length arrays, since they represent physical quantities that vary widely with each particle collision*.
Exposing these columns via the "SupportsScanColumnarBatch" interface has raised a question I have about the DSV2 API. I know the interface is currently Evolving, but I don't know if this is the appropriate place to ask about it (I presume JIRA is a good place as well, but I had trouble finding exactly where the best place to join is) There is no provision in the org.apache.spark.sql.vectorized.ColumnVector interface to return multiple rows of arrays (i.e. no "getArrays" analogue to "getArray"). A big use case we have is to pipe these data through UDFs, so it would be nice to be able to get the data from the file into a UDF batch without having to convert to an intermediate row-wise representation. Looking into ColumnarArray, however, it seems like instead of storing a single offset and length, it could be extended to arrays of "offsets" and "lengths". The public interface could remain the same by adding a 2nd constructor which accepts arrays and keeping the existing constructor as a degenerate case of a 1-length array. * e.g. "electron_momentum" column will have a different number of entries each row, one for each electron that is produced in a collision.