Hi again, Does anyone have thoughts on either the idea or the implementation?
Thanks, Andrew On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo <andrew.m...@gmail.com> wrote: > > Hi all, > > I've opened a WIP PR here https://github.com/apache/spark/pull/28159 > I'm a novice at Scala, so I'm sure the code isn't idiomatic, but it > behaves functionally how I'd expect. I've added unit tests to the PR, > but if you would like to verify the intended functionality, I've > uploaded a fat jar with my datasource to > http://mirror.accre.vanderbilt.edu/spark/laurelin-both.jar and an > example input file to > https://github.com/spark-root/laurelin/raw/master/testdata/stdvector.root. > The following in spark-shell successfully chooses the proper plugin > implementation based on the spark version: > > spark.read.format("root").option("tree","tvec").load("stdvector.root") > > Additionally, I did a very rough POC for spark2.4, which you can find > at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24 > . The same jar/inputfile works there as well. > > Thanks again, > Andrew > > On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo <andrew.m...@gmail.com> wrote: > > > > On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan <cloud0...@gmail.com> wrote: > > > > > > It would be good to support your use case, but I'm not sure how to > > > accomplish that. Can you open a PR so that we can discuss it in detail? > > > How can `public Class<? implements DataSourceV2> getImplementation();` be > > > possible in 3.0 as there is no `DataSourceV2`? > > > > You're right, that was a typo. Since the whole point is to separate > > the (stable) registration interface from the (evolving) DSv2 API, it > > defeats the purpose to then directly reference the DSv2 API within the > > registration interface. > > > > I'll put together a PR. > > > > Thanks again, > > Andrew > > > > > > > > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo <andrew.m...@gmail.com> wrote: > > >> > > >> Hello > > >> > > >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan <cloud0...@gmail.com> wrote: > > >>> > > >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm > > >>> not sure this is possible as the DS V2 API is very different in 3.0, > > >>> e.g. there is no `DataSourceV2` anymore, and you should implement > > >>> `TableProvider` (if you don't have database/table). > > >> > > >> > > >> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a > > >> toplevel Root_v24 (implements DataSourceV2) and Root_v30 (implements > > >> TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it > > >> works well -- as long as I remove the registration from META-INF and > > >> pass in the full class name to the DataFrameReader. > > >> > > >> Thanks > > >> Andrew > > >> > > >>> > > >>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo <andrew.m...@gmail.com> > > >>> wrote: > > >>>> > > >>>> Hi Ryan, > > >>>> > > >>>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue <rb...@netflix.com> wrote: > > >>>> > > > >>>> > Hi Andrew, > > >>>> > > > >>>> > With DataSourceV2, I recommend plugging in a catalog instead of > > >>>> > using DataSource. As you've noticed, the way that you plug in data > > >>>> > sources isn't very flexible. That's one of the reasons why we > > >>>> > changed the plugin system and made it possible to use named catalogs > > >>>> > that load implementations based on configuration properties. > > >>>> > > > >>>> > I think it's fine to consider how to patch the registration trait, > > >>>> > but I really don't recommend continuing to identify table > > >>>> > implementations directly by name. > > >>>> > > >>>> Can you be a bit more concrete with what you mean by plugging a > > >>>> catalog instead of a DataSource? We have been using > > >>>> sc.read.format("root").load([list of paths]) which works well. Since > > >>>> we don't have a database or tables, I don't fully understand what's > > >>>> different between the two interfaces that would make us prefer one or > > >>>> another. > > >>>> > > >>>> That being said, WRT the registration trait, if I'm not misreading > > >>>> createTable() and friends, the "source" parameter is resolved the same > > >>>> way as DataFrameReader.format(), so a solution that helps out > > >>>> registration should help both interfaces. > > >>>> > > >>>> Thanks again, > > >>>> Andrew > > >>>> > > >>>> > > > >>>> > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo <andrew.m...@gmail.com> > > >>>> > wrote: > > >>>> >> > > >>>> >> Hi all, > > >>>> >> > > >>>> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I > > >>>> >> send an email to the dev list for discussion. > > >>>> >> > > >>>> >> As the DSv2 API evolves, some breaking changes are occasionally made > > >>>> >> to the API. It's possible to split a plugin into a "common" part and > > >>>> >> multiple version-specific parts and this works OK to have a single > > >>>> >> artifact for users, as long as they write out the fully qualified > > >>>> >> classname as the DataFrame format(). The one part that can't be > > >>>> >> currently worked around is the DataSourceRegister trait. Since > > >>>> >> classes > > >>>> >> which implement DataSourceRegister must also implement DataSourceV2 > > >>>> >> (and its mixins), changes to those interfaces cause the > > >>>> >> ServiceLoader > > >>>> >> to fail when it attempts to load the "wrong" DataSourceV2 class. > > >>>> >> (there's also an additional prohibition against multiple > > >>>> >> implementations having the same ShortName in > > >>>> >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource). > > >>>> >> This means users will need to update their notebooks/code/tutorials > > >>>> >> if > > >>>> >> they run @ a different site whose cluster is a different version. > > >>>> >> > > >>>> >> To solve this, I proposed in SPARK-31363 a new trait who would > > >>>> >> function the same as the existing DataSourceRegister trait, but adds > > >>>> >> an additional method: > > >>>> >> > > >>>> >> public Class<? implements DataSourceV2> getImplementation(); > > >>>> >> > > >>>> >> ...which will allow DSv2 plugins to dynamically choose the > > >>>> >> appropriate > > >>>> >> DataSourceV2 class based on the runtime environment. This would let > > >>>> >> us > > >>>> >> release a single artifact for different Spark versions and users > > >>>> >> could > > >>>> >> use the same artifactID & format regardless of where they were > > >>>> >> executing their code. If there was no services registered with this > > >>>> >> new trait, the functionality would remain the same as before. > > >>>> >> > > >>>> >> I think this functionality will be useful as DSv2 continues to > > >>>> >> evolve, > > >>>> >> please let me know your thoughts. > > >>>> >> > > >>>> >> Thanks > > >>>> >> Andrew > > >>>> >> > > >>>> >> --------------------------------------------------------------------- > > >>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>>> >> > > >>>> > > > >>>> > > > >>>> > -- > > >>>> > Ryan Blue > > >>>> > Software Engineer > > >>>> > Netflix > > >>>> > > >>>> --------------------------------------------------------------------- > > >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>>> --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org