Spark writing API

Andrew Melo Wed, 31 May 2023 16:57:42 -0700

Hi all

I've been developing for some time a Spark DSv2 plugin "Laurelin" (
https://github.com/spark-root/laurelin
) to read the ROOT (https://root.cern) file format (which is used in high
energy physics). I've recently presented my work in a conference (
https://indico.jlab.org/event/459/contributions/11603/).


All of that to say,

A) is there no reason that the builtin (eg parquet) data sources can't
consume the external APIs? It's hard to write a plugin that has to use a
specific API when you're competing with another source who gets access to
the internals directly.

B) What is the Spark-approved API to code against for to write? There is a
mess of *ColumnWriter classes in the Java namespace, and while there is no
documentation, it's unclear which is preferred by the core (maybe
ArrowWriterColumnVector?). We can give a zero copy write if the API
describes it

C) Putting aside everything above, is there a way to hint to the downstream
users on the number of rows expected to write? Any smart writer will use
off-heap memory to write to disk/memory, so the current API that shoves
rows in doesn't do the trick. You don't want to keep reallocating buffers
constantly

D) what is sparks plan to use arrow-based columnar data representations? I
see that there a lot of external efforts whose only option is to inject
themselves in the CLASSPATH. The regular DSv2 api is already crippled for
reads and for writes it's even worse. Is there a commitment from the spark
core to bring the API to parity? Or is instead is it just a YMMV commitment

Thanks!
Andrew





-- 
It's dark in this basement.

Spark writing API

Reply via email to