On Thu, 1 Jun 2023 at 00:58, Andrew Melo <andrew.m...@gmail.com> wrote:
> Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https://root.cern) file format (which is used in high > energy physics). I've recently presented my work in a conference ( > https://indico.jlab.org/event/459/contributions/11603/). > > nice paper given the esoteric nature of HEP file formats. All of that to say, > > A) is there no reason that the builtin (eg parquet) data sources can't > consume the external APIs? It's hard to write a plugin that has to use a > specific API when you're competing with another source who gets access to > the internals directly. > > B) What is the Spark-approved API to code against for to write? There is a > mess of *ColumnWriter classes in the Java namespace, and while there is no > documentation, it's unclear which is preferred by the core (maybe > ArrowWriterColumnVector?). We can give a zero copy write if the API > describes it > There's a dangerous tendency for things that libraries need to be tagged private [spark], normally worked around by people putting their code into org.apache.spark packages. Really everyone who does that should try to get a longer term fix in, as well as that quick-and-effective workaround. Knowing where problems lie would be a good first step. spark sub-modules are probably a place to get insight into where those low-level internal operations are considered important, although many uses may be for historic "we wrote it that way a long time ago" reasons > > C) Putting aside everything above, is there a way to hint to the > downstream users on the number of rows expected to write? Any smart writer > will use off-heap memory to write to disk/memory, so the current API that > shoves rows in doesn't do the trick. You don't want to keep reallocating > buffers constantly > > D) what is sparks plan to use arrow-based columnar data representations? I > see that there a lot of external efforts whose only option is to inject > themselves in the CLASSPATH. The regular DSv2 api is already crippled for > reads and for writes it's even worse. Is there a commitment from the spark > core to bring the API to parity? Or is instead is it just a YMMV commitment > No idea, I'm afraid. I do think arrow makes a good format for processing, and it'd be interesting to see how well it actually works as a wire format to replace other things (e.g hive's protocol), especially on RDMA networks and the like. I'm not up to date with ongoing work there -if anyone has pointers that'd be interesting. > > Thanks! > Andrew > > > > > > -- > It's dark in this basement. >