Re: DSv2 & DataSourceRegister

Andrew Melo Thu, 16 Apr 2020 10:01:13 -0700

Hi again,

Does anyone have thoughts on either the idea or the implementation?


Thanks,
Andrew

On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo <andrew.m...@gmail.com> wrote:
>
> Hi all,
>
> I've opened a WIP PR here https://github.com/apache/spark/pull/28159
> I'm a novice at Scala, so I'm sure the code isn't idiomatic, but it
> behaves functionally how I'd expect. I've added unit tests to the PR,
> but if you would like to verify the intended functionality, I've
> uploaded a fat jar with my datasource to
> http://mirror.accre.vanderbilt.edu/spark/laurelin-both.jar and an
> example input file to
> https://github.com/spark-root/laurelin/raw/master/testdata/stdvector.root.
> The following in spark-shell successfully chooses the proper plugin
> implementation based on the spark version:
>
> spark.read.format("root").option("tree","tvec").load("stdvector.root")
>
> Additionally, I did a very rough POC for spark2.4, which you can find
> at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24
> . The same jar/inputfile works there as well.
>
> Thanks again,
> Andrew
>
> On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo <andrew.m...@gmail.com> wrote:
> >
> > On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> > >
> > > It would be good to support your use case, but I'm not sure how to 
> > > accomplish that. Can you open a PR so that we can discuss it in detail? 
> > > How can `public Class<? implements DataSourceV2> getImplementation();` be 
> > > possible in 3.0 as there is no `DataSourceV2`?
> >
> > You're right, that was a typo. Since the whole point is to separate
> > the (stable) registration interface from the (evolving) DSv2 API, it
> > defeats the purpose to then directly reference the DSv2 API within the
> > registration interface.
> >
> > I'll put together a PR.
> >
> > Thanks again,
> > Andrew
> >
> > >
> > > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <andrew.m...@gmail.com> wrote:
> > >>
> > >> Hello
> > >>
> > >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0...@gmail.com> wrote:
> > >>>
> > >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm 
> > >>> not sure this is possible as the DS V2 API is very different in 3.0, 
> > >>> e.g. there is no `DataSourceV2` anymore, and you should implement 
> > >>> `TableProvider` (if you don't have database/table).
> > >>
> > >>
> > >> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a 
> > >> toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements 
> > >> TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it 
> > >> works well -- as long as I remove the registration from META-INF and 
> > >> pass in the full class name to the DataFrameReader.
> > >>
> > >> Thanks
> > >> Andrew
> > >>
> > >>>
> > >>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> 
> > >>> wrote:
> > >>>>
> > >>>> Hi Ryan,
> > >>>>
> > >>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote:
> > >>>> >
> > >>>> > Hi Andrew,
> > >>>> >
> > >>>> > With DataSourceV2, I recommend plugging in a catalog instead of 
> > >>>> > using DataSource. As you've noticed, the way that you plug in data 
> > >>>> > sources isn't very flexible. That's one of the reasons why we 
> > >>>> > changed the plugin system and made it possible to use named catalogs 
> > >>>> > that load implementations based on configuration properties.
> > >>>> >
> > >>>> > I think it's fine to consider how to patch the registration trait, 
> > >>>> > but I really don't recommend continuing to identify table 
> > >>>> > implementations directly by name.
> > >>>>
> > >>>> Can you be a bit more concrete with what you mean by plugging a
> > >>>> catalog instead of a DataSource? We have been using
> > >>>> sc.read.format("root").load([list of paths]) which works well. Since
> > >>>> we don't have a database or tables, I don't fully understand what's
> > >>>> different between the two interfaces that would make us prefer one or
> > >>>> another.
> > >>>>
> > >>>> That being said, WRT the registration trait, if I'm not misreading
> > >>>> createTable() and friends, the "source" parameter is resolved the same
> > >>>> way as DataFrameReader.format(), so a solution that helps out
> > >>>> registration should help both interfaces.
> > >>>>
> > >>>> Thanks again,
> > >>>> Andrew
> > >>>>
> > >>>> >
> > >>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> 
> > >>>> > wrote:
> > >>>> >>
> > >>>> >> Hi all,
> > >>>> >>
> > >>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> > >>>> >> send an email to the dev list for discussion.
> > >>>> >>
> > >>>> >> As the DSv2 API evolves, some breaking changes are occasionally made
> > >>>> >> to the API. It's possible to split a plugin into a "common" part and
> > >>>> >> multiple version-specific parts and this works OK to have a single
> > >>>> >> artifact for users, as long as they write out the fully qualified
> > >>>> >> classname as the DataFrame format(). The one part that can't be
> > >>>> >> currently worked around is the DataSourceRegister trait. Since 
> > >>>> >> classes
> > >>>> >> which implement DataSourceRegister must also implement DataSourceV2
> > >>>> >> (and its mixins), changes to those interfaces cause the 
> > >>>> >> ServiceLoader
> > >>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
> > >>>> >> (there's also an additional prohibition against multiple
> > >>>> >> implementations having the same ShortName in
> > >>>> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> > >>>> >> This means users will need to update their notebooks/code/tutorials 
> > >>>> >> if
> > >>>> >> they run @ a different site whose cluster is a different version.
> > >>>> >>
> > >>>> >> To solve this, I proposed in SPARK-31363 a new trait who would
> > >>>> >> function the same as the existing DataSourceRegister trait, but adds
> > >>>> >> an additional method:
> > >>>> >>
> > >>>> >> public Class<? implements DataSourceV2> getImplementation();
> > >>>> >>
> > >>>> >> ...which will allow DSv2 plugins to dynamically choose the 
> > >>>> >> appropriate
> > >>>> >> DataSourceV2 class based on the runtime environment. This would let 
> > >>>> >> us
> > >>>> >> release a single artifact for different Spark versions and users 
> > >>>> >> could
> > >>>> >> use the same artifactID & format regardless of where they were
> > >>>> >> executing their code. If there was no services registered with this
> > >>>> >> new trait, the functionality would remain the same as before.
> > >>>> >>
> > >>>> >> I think this functionality will be useful as DSv2 continues to 
> > >>>> >> evolve,
> > >>>> >> please let me know your thoughts.
> > >>>> >>
> > >>>> >> Thanks
> > >>>> >> Andrew
> > >>>> >>
> > >>>> >> ---------------------------------------------------------------------
> > >>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>> > --
> > >>>> > Ryan Blue
> > >>>> > Software Engineer
> > >>>> > Netflix
> > >>>>
> > >>>> ---------------------------------------------------------------------
> > >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: DSv2 & DataSourceRegister

Reply via email to