Hi Ryan, Thanks for the pointers
On Thu, Nov 7, 2019 at 8:13 AM Ryan Blue <rb...@netflix.com> wrote: > Hi Andrew, > > This is expected behavior for DSv2 in 2.4. A separate reader is configured > for each operation because the configuration will change. A count, for > example, doesn't need to project any columns, but a count distinct will. > Similarly, if your read has different filters we need to apply those to a > separate reader for each scan. > Ah, I presumed that the interaction was slightly different, there was a single reader configured and (e.g.) pruneSchema was called repeatedly to change the desired output schema. I guess for 2.4 it's best for me to cache/memoize the metadata for paths/files to keep them from being repeatedly calculated. > > The newer API that we are releasing in Spark 3.0 addresses the concern > that each reader is independent by using Catalog and Table interfaces. In > the new version, Spark will load a table by name from a persistent catalog > (loaded once) and use the table to create a reader for each operation. That > way, you can load common metadata in the table, cache the table, and pass > its info to readers as they are created. > That's good to know, I'll search around JIRA for docs describing that functionality. Thanks again, Andrew > > rb > > On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo <andrew.m...@gmail.com> wrote: > >> Hello, >> >> During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears >> that our DataSourceReader is being instantiated multiple times for the same >> dataframe. For example, the following snippet >> >> Dataset<Row> df = spark >> .read() >> .format("edu.vanderbilt.accre.laurelin.Root") >> .option("tree", "Events") >> .load("testdata/pristine/2018nanoaod1june2019.root"); >> >> Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls >> createReader once (as an aside, this seems like a lot for 1000 columns? >> "CodeGenerator: Code generated in 8162.847517 ms") >> >> but then running operations on that dataframe (e.g. df.count()) calls >> createReader for each call, instead of holding the existing >> DataSourceReader. >> >> Is that the expected behavior? Because of the file format, it's quite >> expensive to deserialize all the various metadata, so I was holding the >> deserialized version in the DataSourceReader, but if Spark is repeatedly >> constructing new ones, then that doesn't help. If this is the expected >> behavior, how should I handle this as a consumer of the API? >> >> Thanks! >> Andrew >> > > > -- > Ryan Blue > Software Engineer > Netflix >