Re: Spark writing API

Steve Loughran Mon, 07 Aug 2023 05:28:07 -0700

On Thu, 1 Jun 2023 at 00:58, Andrew Melo <[email protected]> wrote:


> Hi all
>
> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
> https://github.com/spark-root/laurelin
> ) to read the ROOT (https://root.cern) file format (which is used in high
> energy physics). I've recently presented my work in a conference (
> https://indico.jlab.org/event/459/contributions/11603/).
>
>
nice paper given the esoteric nature of HEP file formats.

All of that to say,
>
> A) is there no reason that the builtin (eg parquet) data sources can't
> consume the external APIs? It's hard to write a plugin that has to use a
> specific API when you're competing with another source who gets access to
> the internals directly.
>
> B) What is the Spark-approved API to code against for to write? There is a
> mess of *ColumnWriter classes in the Java namespace, and while there is no
> documentation, it's unclear which is preferred by the core (maybe
> ArrowWriterColumnVector?). We can give a zero copy write if the API
> describes it
>

There's a dangerous tendency for things that libraries need to be tagged
private [spark], normally worked around by people putting their code into
org.apache.spark packages. Really everyone who does that should try to get
a longer term fix in, as well as that quick-and-effective workaround.
Knowing where problems lie would be a good first step. spark sub-modules
are probably a place to get insight into where those low-level internal
operations are considered important, although many uses may be for historic
"we wrote it that way a long time ago" reasons


>
> C) Putting aside everything above, is there a way to hint to the
> downstream users on the number of rows expected to write? Any smart writer
> will use off-heap memory to write to disk/memory, so the current API that
> shoves rows in doesn't do the trick. You don't want to keep reallocating
> buffers constantly
>
> D) what is sparks plan to use arrow-based columnar data representations? I
> see that there a lot of external efforts whose only option is to inject
> themselves in the CLASSPATH. The regular DSv2 api is already crippled for
> reads and for writes it's even worse. Is there a commitment from the spark
> core to bring the API to parity? Or is instead is it just a YMMV commitment
>

No idea, I'm afraid. I do think arrow makes a good format for processing,
and it'd be interesting to see how well it actually works as a wire format
to replace other things (e.g hive's protocol), especially on RDMA networks
and the like. I'm not up to date with ongoing work there -if anyone has
pointers that'd be interesting.

>
> Thanks!
> Andrew
>
>
>
>
>
> --
> It's dark in this basement.
>

Re: Spark writing API

Reply via email to