DSv2 & DataSourceRegister

Andrew Melo Tue, 07 Apr 2020 12:27:10 -0700

Hi all,

I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
send an email to the dev list for discussion.


As the DSv2 API evolves, some breaking changes are occasionally made
to the API. It's possible to split a plugin into a "common" part and
multiple version-specific parts and this works OK to have a single
artifact for users, as long as they write out the fully qualified
classname as the DataFrame format(). The one part that can't be
currently worked around is the DataSourceRegister trait. Since classes
which implement DataSourceRegister must also implement DataSourceV2
(and its mixins), changes to those interfaces cause the ServiceLoader
to fail when it attempts to load the "wrong" DataSourceV2 class.
(there's also an additional prohibition against multiple
implementations having the same ShortName in
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
This means users will need to update their notebooks/code/tutorials if
they run @ a different site whose cluster is a different version.

To solve this, I proposed in SPARK-31363 a new trait who would
function the same as the existing DataSourceRegister trait, but adds
an additional method:

public Class<? implements DataSourceV2> getImplementation();

...which will allow DSv2 plugins to dynamically choose the appropriate
DataSourceV2 class based on the runtime environment. This would let us
release a single artifact for different Spark versions and users could
use the same artifactID & format regardless of where they were
executing their code. If there was no services registered with this
new trait, the functionality would remain the same as before.

I think this functionality will be useful as DSv2 continues to evolve,
please let me know your thoughts.

Thanks
Andrew

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

DSv2 & DataSourceRegister

Reply via email to