Re: DSv2 & DataSourceRegister

Andrew Melo Thu, 09 Apr 2020 21:32:45 -0700

Hi all,

I've opened a WIP PR here https://github.com/apache/spark/pull/28159
I'm a novice at Scala, so I'm sure the code isn't idiomatic, but it
behaves functionally how I'd expect. I've added unit tests to the PR,
but if you would like to verify the intended functionality, I've
uploaded a fat jar with my datasource to
http://mirror.accre.vanderbilt.edu/spark/laurelin-both.jar and an
example input file to
https://github.com/spark-root/laurelin/raw/master/testdata/stdvector.root.
The following in spark-shell successfully chooses the proper plugin
implementation based on the spark version:


spark.read.format("root").option("tree","tvec").load("stdvector.root")

Additionally, I did a very rough POC for spark2.4, which you can find
at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24
. The same jar/inputfile works there as well.

Thanks again,
Andrew

On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo <andrew.m...@gmail.com> wrote:
>
> On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> >
> > It would be good to support your use case, but I'm not sure how to 
> > accomplish that. Can you open a PR so that we can discuss it in detail? How 
> > can `public Class<? implements DataSourceV2> getImplementation();` be 
> > possible in 3.0 as there is no `DataSourceV2`?
>
> You're right, that was a typo. Since the whole point is to separate
> the (stable) registration interface from the (evolving) DSv2 API, it
> defeats the purpose to then directly reference the DSv2 API within the
> registration interface.
>
> I'll put together a PR.
>
> Thanks again,
> Andrew
>
> >
> > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <andrew.m...@gmail.com> wrote:
> >>
> >> Hello
> >>
> >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0...@gmail.com> wrote:
> >>>
> >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not 
> >>> sure this is possible as the DS V2 API is very different in 3.0, e.g. 
> >>> there is no `DataSourceV2` anymore, and you should implement 
> >>> `TableProvider` (if you don't have database/table).
> >>
> >>
> >> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel 
> >> Root_v24 (implements DataSourceV2) and Root_v30 (implements 
> >> TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it 
> >> works well -- as long as I remove the registration from META-INF and pass 
> >> in the full class name to the DataFrameReader.
> >>
> >> Thanks
> >> Andrew
> >>
> >>>
> >>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> wrote:
> >>>>
> >>>> Hi Ryan,
> >>>>
> >>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote:
> >>>> >
> >>>> > Hi Andrew,
> >>>> >
> >>>> > With DataSourceV2, I recommend plugging in a catalog instead of using 
> >>>> > DataSource. As you've noticed, the way that you plug in data sources 
> >>>> > isn't very flexible. That's one of the reasons why we changed the 
> >>>> > plugin system and made it possible to use named catalogs that load 
> >>>> > implementations based on configuration properties.
> >>>> >
> >>>> > I think it's fine to consider how to patch the registration trait, but 
> >>>> > I really don't recommend continuing to identify table implementations 
> >>>> > directly by name.
> >>>>
> >>>> Can you be a bit more concrete with what you mean by plugging a
> >>>> catalog instead of a DataSource? We have been using
> >>>> sc.read.format("root").load([list of paths]) which works well. Since
> >>>> we don't have a database or tables, I don't fully understand what's
> >>>> different between the two interfaces that would make us prefer one or
> >>>> another.
> >>>>
> >>>> That being said, WRT the registration trait, if I'm not misreading
> >>>> createTable() and friends, the "source" parameter is resolved the same
> >>>> way as DataFrameReader.format(), so a solution that helps out
> >>>> registration should help both interfaces.
> >>>>
> >>>> Thanks again,
> >>>> Andrew
> >>>>
> >>>> >
> >>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> 
> >>>> > wrote:
> >>>> >>
> >>>> >> Hi all,
> >>>> >>
> >>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> >>>> >> send an email to the dev list for discussion.
> >>>> >>
> >>>> >> As the DSv2 API evolves, some breaking changes are occasionally made
> >>>> >> to the API. It's possible to split a plugin into a "common" part and
> >>>> >> multiple version-specific parts and this works OK to have a single
> >>>> >> artifact for users, as long as they write out the fully qualified
> >>>> >> classname as the DataFrame format(). The one part that can't be
> >>>> >> currently worked around is the DataSourceRegister trait. Since classes
> >>>> >> which implement DataSourceRegister must also implement DataSourceV2
> >>>> >> (and its mixins), changes to those interfaces cause the ServiceLoader
> >>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class.
> >>>> >> (there's also an additional prohibition against multiple
> >>>> >> implementations having the same ShortName in
> >>>> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> >>>> >> This means users will need to update their notebooks/code/tutorials if
> >>>> >> they run @ a different site whose cluster is a different version.
> >>>> >>
> >>>> >> To solve this, I proposed in SPARK-31363 a new trait who would
> >>>> >> function the same as the existing DataSourceRegister trait, but adds
> >>>> >> an additional method:
> >>>> >>
> >>>> >> public Class<? implements DataSourceV2> getImplementation();
> >>>> >>
> >>>> >> ...which will allow DSv2 plugins to dynamically choose the appropriate
> >>>> >> DataSourceV2 class based on the runtime environment. This would let us
> >>>> >> release a single artifact for different Spark versions and users could
> >>>> >> use the same artifactID & format regardless of where they were
> >>>> >> executing their code. If there was no services registered with this
> >>>> >> new trait, the functionality would remain the same as before.
> >>>> >>
> >>>> >> I think this functionality will be useful as DSv2 continues to evolve,
> >>>> >> please let me know your thoughts.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Andrew
> >>>> >>
> >>>> >> ---------------------------------------------------------------------
> >>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>> >>
> >>>> >
> >>>> >
> >>>> > --
> >>>> > Ryan Blue
> >>>> > Software Engineer
> >>>> > Netflix
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: DSv2 & DataSourceRegister

Reply via email to