Hello,

During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that
our DataSourceReader is being instantiated multiple times for the same
dataframe. For example, the following snippet

        Dataset<Row> df = spark
                .read()
                .format("edu.vanderbilt.accre.laurelin.Root")
                .option("tree",  "Events")
                .load("testdata/pristine/2018nanoaod1june2019.root");

Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls
createReader once (as an aside, this seems like a lot for 1000 columns?
"CodeGenerator: Code generated in 8162.847517 ms")

but then running operations on that dataframe (e.g. df.count()) calls
createReader for each call, instead of holding the existing
DataSourceReader.

Is that the expected behavior? Because of the file format, it's quite
expensive to deserialize all the various metadata, so I was holding the
deserialized version in the DataSourceReader, but if Spark is repeatedly
constructing new ones, then that doesn't help. If this is the expected
behavior, how should I handle this as a consumer of the API?

Thanks!
Andrew

Reply via email to