Re: DSv2 & DataSourceRegister

Andrew Melo Tue, 07 Apr 2020 22:13:45 -0700

Hello

On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0...@gmail.com> wrote:


> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not
> sure this is possible as the DS V2 API is very different in 3.0, e.g. there
> is no `DataSourceV2` anymore, and you should implement `TableProvider` (if
> you don't have database/table).
>

Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel
Root_v24 (implements DataSourceV2) and Root_v30 (implements TableProvider).
I can load this jar in a both pyspark 2.4 and 3.0 and it works well -- as
long as I remove the registration from META-INF and pass in the full class
name to the DataFrameReader.

Thanks
Andrew


> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote:
>> >
>> > Hi Andrew,
>> >
>> > With DataSourceV2, I recommend plugging in a catalog instead of using
>> DataSource. As you've noticed, the way that you plug in data sources isn't
>> very flexible. That's one of the reasons why we changed the plugin system
>> and made it possible to use named catalogs that load implementations based
>> on configuration properties.
>> >
>> > I think it's fine to consider how to patch the registration trait, but
>> I really don't recommend continuing to identify table implementations
>> directly by name.
>>
>> Can you be a bit more concrete with what you mean by plugging a
>> catalog instead of a DataSource? We have been using
>> sc.read.format("root").load([list of paths]) which works well. Since
>> we don't have a database or tables, I don't fully understand what's
>> different between the two interfaces that would make us prefer one or
>> another.
>>
>> That being said, WRT the registration trait, if I'm not misreading
>> createTable() and friends, the "source" parameter is resolved the same
>> way as DataFrameReader.format(), so a solution that helps out
>> registration should help both interfaces.
>>
>> Thanks again,
>> Andrew
>>
>> >
>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com>
>> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> >> send an email to the dev list for discussion.
>> >>
>> >> As the DSv2 API evolves, some breaking changes are occasionally made
>> >> to the API. It's possible to split a plugin into a "common" part and
>> >> multiple version-specific parts and this works OK to have a single
>> >> artifact for users, as long as they write out the fully qualified
>> >> classname as the DataFrame format(). The one part that can't be
>> >> currently worked around is the DataSourceRegister trait. Since classes
>> >> which implement DataSourceRegister must also implement DataSourceV2
>> >> (and its mixins), changes to those interfaces cause the ServiceLoader
>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
>> >> (there's also an additional prohibition against multiple
>> >> implementations having the same ShortName in
>> >>
>> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
>> >> This means users will need to update their notebooks/code/tutorials if
>> >> they run @ a different site whose cluster is a different version.
>> >>
>> >> To solve this, I proposed in SPARK-31363 a new trait who would
>> >> function the same as the existing DataSourceRegister trait, but adds
>> >> an additional method:
>> >>
>> >> public Class<? implements DataSourceV2> getImplementation();
>> >>
>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
>> >> DataSourceV2 class based on the runtime environment. This would let us
>> >> release a single artifact for different Spark versions and users could
>> >> use the same artifactID & format regardless of where they were
>> >> executing their code. If there was no services registered with this
>> >> new trait, the functionality would remain the same as before.
>> >>
>> >> I think this functionality will be useful as DSv2 continues to evolve,
>> >> please let me know your thoughts.
>> >>
>> >> Thanks
>> >> Andrew
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: DSv2 & DataSourceRegister

Reply via email to