Hi all I've been developing for some time a Spark DSv2 plugin "Laurelin" ( https://github.com/spark-root/laurelin ) to read the ROOT (https://root.cern) file format (which is used in high energy physics). I've recently presented my work in a conference ( https://indico.jlab.org/event/459/contributions/11603/).
All of that to say, A) is there no reason that the builtin (eg parquet) data sources can't consume the external APIs? It's hard to write a plugin that has to use a specific API when you're competing with another source who gets access to the internals directly. B) What is the Spark-approved API to code against for to write? There is a mess of *ColumnWriter classes in the Java namespace, and while there is no documentation, it's unclear which is preferred by the core (maybe ArrowWriterColumnVector?). We can give a zero copy write if the API describes it C) Putting aside everything above, is there a way to hint to the downstream users on the number of rows expected to write? Any smart writer will use off-heap memory to write to disk/memory, so the current API that shoves rows in doesn't do the trick. You don't want to keep reallocating buffers constantly D) what is sparks plan to use arrow-based columnar data representations? I see that there a lot of external efforts whose only option is to inject themselves in the CLASSPATH. The regular DSv2 api is already crippled for reads and for writes it's even worse. Is there a commitment from the spark core to bring the API to parity? Or is instead is it just a YMMV commitment Thanks! Andrew -- It's dark in this basement.