DSv2 reader lifecycle

Andrew Melo Tue, 05 Nov 2019 16:58:26 -0800

Hello,

During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that
our DataSourceReader is being instantiated multiple times for the same
dataframe. For example, the following snippet


        Dataset<Row> df = spark
                .read()
                .format("edu.vanderbilt.accre.laurelin.Root")
                .option("tree",  "Events")
                .load("testdata/pristine/2018nanoaod1june2019.root");

Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls
createReader once (as an aside, this seems like a lot for 1000 columns?
"CodeGenerator: Code generated in 8162.847517 ms")

but then running operations on that dataframe (e.g. df.count()) calls
createReader for each call, instead of holding the existing
DataSourceReader.

Is that the expected behavior? Because of the file format, it's quite
expensive to deserialize all the various metadata, so I was holding the
deserialized version in the DataSourceReader, but if Spark is repeatedly
constructing new ones, then that doesn't help. If this is the expected
behavior, how should I handle this as a consumer of the API?

Thanks!
Andrew

DSv2 reader lifecycle

Reply via email to