Hello On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0...@gmail.com> wrote:
> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not > sure this is possible as the DS V2 API is very different in 3.0, e.g. there > is no `DataSourceV2` anymore, and you should implement `TableProvider` (if > you don't have database/table). > Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as long as I remove the registration from META-INF and pass in the full class name to the DataFrameReader. Thanks Andrew > On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> wrote: > >> Hi Ryan, >> >> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote: >> > >> > Hi Andrew, >> > >> > With DataSourceV2, I recommend plugging in a catalog instead of using >> DataSource. As you've noticed, the way that you plug in data sources isn't >> very flexible. That's one of the reasons why we changed the plugin system >> and made it possible to use named catalogs that load implementations based >> on configuration properties. >> > >> > I think it's fine to consider how to patch the registration trait, but >> I really don't recommend continuing to identify table implementations >> directly by name. >> >> Can you be a bit more concrete with what you mean by plugging a >> catalog instead of a DataSource? We have been using >> sc.read.format("root").load([list of paths]) which works well. Since >> we don't have a database or tables, I don't fully understand what's >> different between the two interfaces that would make us prefer one or >> another. >> >> That being said, WRT the registration trait, if I'm not misreading >> createTable() and friends, the "source" parameter is resolved the same >> way as DataFrameReader.format(), so a solution that helps out >> registration should help both interfaces. >> >> Thanks again, >> Andrew >> >> > >> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> >> wrote: >> >> >> >> Hi all, >> >> >> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I >> >> send an email to the dev list for discussion. >> >> >> >> As the DSv2 API evolves, some breaking changes are occasionally made >> >> to the API. It's possible to split a plugin into a "common" part and >> >> multiple version-specific parts and this works OK to have a single >> >> artifact for users, as long as they write out the fully qualified >> >> classname as the DataFrame format(). The one part that can't be >> >> currently worked around is the DataSourceRegister trait. Since classes >> >> which implement DataSourceRegister must also implement DataSourceV2 >> >> (and its mixins), changes to those interfaces cause the ServiceLoader >> >> to fail when it attempts to load the "wrong" DataSourceV2 class. >> >> (there's also an additional prohibition against multiple >> >> implementations having the same ShortName in >> >> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource). >> >> This means users will need to update their notebooks/code/tutorials if >> >> they run @ a different site whose cluster is a different version. >> >> >> >> To solve this, I proposed in SPARK-31363 a new trait who would >> >> function the same as the existing DataSourceRegister trait, but adds >> >> an additional method: >> >> >> >> public Class<? implements DataSourceV2> getImplementation(); >> >> >> >> ...which will allow DSv2 plugins to dynamically choose the appropriate >> >> DataSourceV2 class based on the runtime environment. This would let us >> >> release a single artifact for different Spark versions and users could >> >> use the same artifactID & format regardless of where they were >> >> executing their code. If there was no services registered with this >> >> new trait, the functionality would remain the same as before. >> >> >> >> I think this functionality will be useful as DSv2 continues to evolve, >> >> please let me know your thoughts. >> >> >> >> Thanks >> >> Andrew >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> > >> > >> > -- >> > Ryan Blue >> > Software Engineer >> > Netflix >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>