Hello Spark Devs Could anyone help me with this?
Thanks, Andrew On Wed, May 31, 2023 at 20:57 Andrew Melo <andrew.m...@gmail.com> wrote: > Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https://root.cern) file format (which is used in high > energy physics). I've recently presented my work in a conference ( > https://indico.jlab.org/event/459/contributions/11603/). > > All of that to say, > > A) is there no reason that the builtin (eg parquet) data sources can't > consume the external APIs? It's hard to write a plugin that has to use a > specific API when you're competing with another source who gets access to > the internals directly. > > B) What is the Spark-approved API to code against for to write? There is a > mess of *ColumnWriter classes in the Java namespace, and while there is no > documentation, it's unclear which is preferred by the core (maybe > ArrowWriterColumnVector?). We can give a zero copy write if the API > describes it > > C) Putting aside everything above, is there a way to hint to the > downstream users on the number of rows expected to write? Any smart writer > will use off-heap memory to write to disk/memory, so the current API that > shoves rows in doesn't do the trick. You don't want to keep reallocating > buffers constantly > > D) what is sparks plan to use arrow-based columnar data representations? I > see that there a lot of external efforts whose only option is to inject > themselves in the CLASSPATH. The regular DSv2 api is already crippled for > reads and for writes it's even worse. Is there a commitment from the spark > core to bring the API to parity? Or is instead is it just a YMMV commitment > > Thanks! > Andrew > > > > > > -- > It's dark in this basement. > -- It's dark in this basement.