Re: DSv2 & DataSourceRegister

Ryan Blue Tue, 07 Apr 2020 15:21:45 -0700

Hi Andrew,

With DataSourceV2, I recommend plugging in a catalog instead of using
DataSource. As you've noticed, the way that you plug in data sources isn't
very flexible. That's one of the reasons why we changed the plugin system
and made it possible to use named catalogs that load implementations based
on configuration properties.


I think it's fine to consider how to patch the registration trait, but I
really don't recommend continuing to identify table implementations
directly by name.

On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> wrote:

> Hi all,
>
> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> send an email to the dev list for discussion.
>
> As the DSv2 API evolves, some breaking changes are occasionally made
> to the API. It's possible to split a plugin into a "common" part and
> multiple version-specific parts and this works OK to have a single
> artifact for users, as long as they write out the fully qualified
> classname as the DataFrame format(). The one part that can't be
> currently worked around is the DataSourceRegister trait. Since classes
> which implement DataSourceRegister must also implement DataSourceV2
> (and its mixins), changes to those interfaces cause the ServiceLoader
> to fail when it attempts to load the "wrong" DataSourceV2 class.
> (there's also an additional prohibition against multiple
> implementations having the same ShortName in
> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> This means users will need to update their notebooks/code/tutorials if
> they run @ a different site whose cluster is a different version.
>
> To solve this, I proposed in SPARK-31363 a new trait who would
> function the same as the existing DataSourceRegister trait, but adds
> an additional method:
>
> public Class<? implements DataSourceV2> getImplementation();
>
> ...which will allow DSv2 plugins to dynamically choose the appropriate
> DataSourceV2 class based on the runtime environment. This would let us
> release a single artifact for different Spark versions and users could
> use the same artifactID & format regardless of where they were
> executing their code. If there was no services registered with this
> new trait, the functionality would remain the same as before.
>
> I think this functionality will be useful as DSv2 continues to evolve,
> please let me know your thoughts.
>
> Thanks
> Andrew
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DSv2 & DataSourceRegister

Reply via email to