Re: DataSourceV2 community sync #3

2018-12-03 Thread Ryan Blue
incorporate the “options” specified for the > data source into the catalog too? > > That may be helpful in some situations (e.g. the JDBC connect string being > available from the catalog). > > *From: *Xiao Li > *Date: *Monday, December 3, 2018 at 10:44 AM > *To: *"Thakrar, Jay

Re: DataSourceV2 community sync #3

2018-12-03 Thread Ryan Blue
d the Spark catalog be the common denominator of the other > catalogs (least featured) or a super-feature catalog? > > > > *From: *Xiao Li > *Date: *Saturday, December 1, 2018 at 10:49 PM > *To: *Ryan Blue > *Cc: *"u...@spark.apache.org" > *Subject: *Re: DataSourceV2 co

Re: DataSourceV2 community sync #3

2018-12-03 Thread Ryan Blue
uot;spark_catalog1.db3.tab2". > The catalog will be used for registering all the external data sources, > various Spark UDFs and so on. > > At the same time, we should NOT mix the table-level data sources with > catalog support. That means, "Cassandra1.db1.tab1", "Kaf

Re: DataSourceV2 community sync #3

2018-12-01 Thread Ryan Blue
the data source API V2 and catalog APIs are two separate projects. > Hopefully, you understand my concern. If we really want to mix them > together, I want to read the design of your multi-catalog support and > understand more details. > > Thanks, > > Xiao > > > > > R

Re: Public v2 interface location

2018-12-01 Thread Ryan Blue
Hi, Ryan Blue. > > I don't think it would be a good idea to add the sql-api module. > I prefer to add sql-api to sql/core. The sql is just another representation > of dataset, thus there is no need to add new module to do this. Besides, it > would be easier to add sql-api in core. >

Public v2 interface location

2018-11-30 Thread Ryan Blue
we started: we can either expose the v2 API from the catalyst package, or we can keep the v2 API, logical plans, and rules in core instead of catalyst. Anyone want to weigh in with a preference for how to move forward? rb -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 community sync #3

2018-11-29 Thread Ryan Blue
unction, database > and column is resolved? Do we have nickname, mapping, wrapper? > > Or I might miss the design docs you send? Could you post the doc? > > Thanks, > > Xiao > > > > > Ryan Blue 于2018年11月29日周四 下午3:06写道: > >> Xiao, >> >> Please

Re: DataSourceV2 community sync #3

2018-11-29 Thread Ryan Blue
t stage, but we need to know how the new > proposal works. For example, how to plug in a new Hive metastore? How to > plug in a Glue? How do users implement a new external catalog without > adding any new data sources? Without knowing more details, it is hard to > say whether this TableCatalog can

Re: DataSourceV2 community sync #3

2018-11-29 Thread Ryan Blue
that TableCatalog is compatible with future decisions and the best path forward is to build incrementally. An exhaustive design process blocks progress on v2. On Mon, Nov 26, 2018 at 2:54 PM Ryan Blue wrote: > Hi everyone, > > I just sent out an invite for the next DSv2 commu

DataSourceV2 community sync #3

2018-11-26 Thread Ryan Blue
and I’ll add you. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-20 Thread Ryan Blue
a while something would silently > > > break, because PR builds only check the default. And the jenkins > > > builds, which are less monitored, would stay broken for a while. > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 sync tomorrow

2018-11-15 Thread Ryan Blue
a couple of tests, it looks like live streams only work within an organization. In the future, I won’t add a live stream since no one but people from Netflix can join. Last, here are the notes: *Attendees* Ryan Blue - Netflix John Zhuge - Netflix Yuanjian Li - Baidu - Interested in Catalog API Felix

Re: DataSourceV2 sync tomorrow

2018-11-14 Thread Ryan Blue
the meet up. I'll also plan on joining earlier than I did last time, in case we the meet/hangout needs to be up for people to view the live stream. rb On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue wrote: > Hi everyone, > I just wanted to send out a reminder that there’s a DSv2 sync tomorrow a

Re: DataSourceV2 sync tomorrow

2018-11-14 Thread Ryan Blue
ies of a micro-batch) and may be then the >> 'latest' offset is not needed at all. >> >> - Arun >> >> >> On Tue, 13 Nov 2018 at 16:01, Ryan Blue >> wrote: >> >>> Hi everyone, >>> I just wanted to send out a reminder that there’

DataSourceV2 sync tomorrow

2018-11-13 Thread Ryan Blue
tPartition[] parts = stream.planInputPartitions(start) // returns when needsReconfiguration is true or all tasks finish runTasks(parts, factory, end) // the stream's current offset has been updated at the last epoch } -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
Another solution to the decimal case is using the capability API: use a capability to signal that the table knows about `supports-decimal`. So before the decimal support check, it would check `table.isSupported("type-capabilities")`. On Fri, Nov 9, 2018 at 12:45 PM Ryan B

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
to > throw exceptions when they don't support a specific operation. > > > On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue wrote: > >> Do you have an example in mind where we might add a capability and break >> old versions of data sources? >> >> These are really for

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
pporting that property, and thus throwing an > exception. > > > On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue wrote: > >> I'd have two places. First, a class that defines properties supported and >> identified by Spark, like the SQLConf definitions. Second, in documentation >>

Re: Behavior of SaveMode.Append when table is not present

2018-11-09 Thread Ryan Blue
ting data* > > However it does not specify behavior when the table does not exist. > Does that throw exception or create the table or a NO-OP? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
ned? > > > -- > *From:* Ryan Blue > *Sent:* Thursday, November 8, 2018 2:09 PM > *To:* Reynold Xin > *Cc:* Spark Dev List > *Subject:* Re: DataSourceV2 capability API > > > Yes, we currently use traits that have methods. Something like

Re: DataSourceV2 capability API

2018-11-08 Thread Ryan Blue
ll evolve (e.g. how many different > capabilities there will be). > > > On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue > wrote: > >> Hi everyone, >> >> I’d like to propose an addition to DataSourceV2 tables, a capability API. >> This API would allow Spark t

DataSourceV2 capability API

2018-11-08 Thread Ryan Blue
le. To fix this problem, I would use a table capability, like read-missing-columns-as-null. Any comments on this approach? rb -- Ryan Blue Software Engineer Netflix

Re: Test and support only LTS JDK release?

2018-11-06 Thread Ryan Blue
this in Spark community. >> >> Thanks, >> >> DB Tsai | Siri Open Source Technologies [not a contribution] |  >> Apple, Inc >> >> > -- > Robert Stupp > @snazy > > -- Ryan Blue Software Engineer Netflix

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Ryan Blue
hnologies [not a contribution] |  > Apple, Inc > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 hangouts sync

2018-11-01 Thread Ryan Blue
Thanks to everyone that attended the sync! We had some good discussions. Here are my notes for anyone that missed it or couldn’t join the live stream. If anyone wants to add to this, please send additional thoughts or corrections. *Attendees:* - Ryan Blue - Netflix - Using v2 to integrate

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-30 Thread Ryan Blue
; >> >> What should happen to JIRA tickets still targeting 2.4.0? >>> >> >> === >>> >> >> >>> >> >> The current list of open tickets targeted at 2.4.0 can be found at: >>> >> >> https://issues.apache.org/jira/projects/SPARK and search for >>> "Target Version/s" = 2.4.0 >>> >> >> >>> >> >> Committers should look at those and triage. Extremely important bug >>> >> >> fixes, documentation, and API tweaks that impact compatibility >>> should >>> >> >> be worked on immediately. Everything else please retarget to an >>> >> >> appropriate release. >>> >> >> >>> >> >> == >>> >> >> But my bug isn't fixed? >>> >> >> == >>> >> >> >>> >> >> In order to make timely releases, we will typically not hold the >>> >> >> release unless the bug in question is a regression from the >>> previous >>> >> >> release. That being said, if there is something which is a >>> regression >>> >> >> that has not been correctly targeted please ping me or a committer >>> to >>> >> >> help target the issue. >>> >> > >>> >> > >>> - >>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >> > >>> >> >>> >> >>> >> - >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >> >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> -- >> [image: Spark+AI Summit North America 2019] >> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america=undefined=406b8c9a-b648-4923-9ed1-9a51ffe213fa> >> > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 hangouts sync

2018-10-29 Thread Ryan Blue
end up with so many people that we can't actually get the discussion going. Here's a link to the stream: https://stream.meet.google.com/stream/6be59d80-04c7-44dc-9042-4f3b597fc8ba Thanks! rb On Thu, Oct 25, 2018 at 1:09 PM Ryan Blue wrote: > Hi everyone, > > There's been some great d

Re: DataSourceV2 hangouts sync

2018-10-26 Thread Ryan Blue
; > I didn't know I live in the same timezone with you Wenchen :D. > Monday or Wednesday at 5PM PDT sounds good to me too FWIW. > > 2018년 10월 26일 (금) 오전 8:29, Ryan Blue 님이 작성: > >> Good point. How about Monday or Wednesday at 5PM PDT then? >> >> Everyone, please repl

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
day at my side, it will be great if we can > pick a day from Monday to Thursday. > > On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue wrote: > >> Since not many people have replied with a time window, how about we aim >> for 5PM PDT? That should work for Wenchen and most peo

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
eting is definitely helpful to discuss, move certain effort >>>>> forward and keep people on the same page. Glad to see this kind of working >>>>> group happening. >>>>> >>>>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge wrote: >>>&g

DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
-- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ryan Blue
though Apache Spark provides the binary distributions, it would be > great if this succeeds out of the box. > > > > Bests, > > Dongjoon. > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
elson, Assaf >>> wrote: >>> >>> Could you add a fuller code example? I tried to reproduce it in my >>> environment and I am getting just one instance of the reader… >>> >>> >>> >>> Thanks, >>> >>> Assaf >

Re: Data source V2 in spark 2.4.0

2018-10-04 Thread Ryan Blue
inded message, I will probably have more as I continue > to explore this. > > Thanks, >Assaf. > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Spark SQL parser and DDL

2018-10-04 Thread Ryan Blue
that converts from the parsed SQL plan to CatalogTable-based v1 plans. It is also cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis instead of in the parser itself. Are there objections to this approach for integrating v2 plans? -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Syntax for table DDL

2018-10-04 Thread Ryan Blue
add Hive compatible syntax later. > > On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue > wrote: > >> I'd say that it was important to be compatible with Hive in the past, but >> that's becoming less important over time. Spark is well established with >> Hadoop users and I think the f

Re: [DISCUSS] Syntax for table DDL

2018-10-02 Thread Ryan Blue
. > > I am personally following this PR with a lot of interest, thanks for all > the work along this direction. > > Best regards, > Alessandro > > On Mon, 1 Oct 2018 at 20:21, Ryan Blue wrote: > >> What do you mean by consistent with the syntax in SqlBase.g4? These >>

Re: [DISCUSS] Syntax for table DDL

2018-10-01 Thread Ryan Blue
lowing the Hive DDL syntax: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column > > Ryan Blue 于2018年9月28日周五 下午3:47写道: > >> Hi everyone, >> >> I’m currently working on new table DDL statements for v2 tables. F

Re: Data source V2 in spark 2.4.0

2018-10-01 Thread Ryan Blue
---- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

[DISCUSS] Syntax for table DDL

2018-09-28 Thread Ryan Blue
if you have suggestions based on a different SQL engine or want this syntax to be different for another reason. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-25 Thread Ryan Blue
ttp://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread Ryan Blue
ion > items as well. > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Ryan Blue
> Hi, Ryan. > > Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be > helpful to narrow down the scope. > > Bests, > Dongjoon. > > On Thu, Sep 20, 2018 at 11:56 Ryan Blue wrote: > >> -0 >> >> My DataSourceV2 implementation for Iceberg is f

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Ryan Blue
t; >>>>>>>> > If you are a Spark user, you can help us test this release by >>>>>>>> taking >>>>>>>> > an existing Spark workload and running on this release candidate, >>>>>>>> then >>>>>>>> > reporting any regressions. >>>>>>>> > >>>>>>>> > If you're working in PySpark you can set up a virtual env and >>>>>>>> install >>>>>>>> > the current RC and see if anything important breaks, in the >>>>>>>> Java/Scala >>>>>>>> > you can add the staging repository to your projects resolvers and >>>>>>>> test >>>>>>>> > with the RC (make sure to clean up the artifact cache >>>>>>>> before/after so >>>>>>>> > you don't end up building with a out of date RC going forward). >>>>>>>> > >>>>>>>> > === >>>>>>>> > What should happen to JIRA tickets still targeting 2.3.2? >>>>>>>> > === >>>>>>>> > >>>>>>>> > The current list of open tickets targeted at 2.3.2 can be found >>>>>>>> at: >>>>>>>> > https://issues.apache.org/jira/projects/SPARK and search for >>>>>>>> "Target Version/s" = 2.3.2 >>>>>>>> > >>>>>>>> > Committers should look at those and triage. Extremely important >>>>>>>> bug >>>>>>>> > fixes, documentation, and API tweaks that impact compatibility >>>>>>>> should >>>>>>>> > be worked on immediately. Everything else please retarget to an >>>>>>>> > appropriate release. >>>>>>>> > >>>>>>>> > == >>>>>>>> > But my bug isn't fixed? >>>>>>>> > == >>>>>>>> > >>>>>>>> > In order to make timely releases, we will typically not hold the >>>>>>>> > release unless the bug in question is a regression from the >>>>>>>> previous >>>>>>>> > release. That being said, if there is something which is a >>>>>>>> regression >>>>>>>> > that has not been correctly targeted please ping me or a >>>>>>>> committer to >>>>>>>> > help target the issue. >>>>>>>> >>>>>>>> >>>>>>>> - >>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >>> -- >>> John >>> >> -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
partition loading in Hive and Oracle. > > > > So in short, I agree that partition management should be an optional > interface. > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Wednesday, September 19, 2018 at 2:58 PM >

Re: data source api v2 refactoring

2018-09-19 Thread Ryan Blue
> > I ask because those are the most widely used data sources and have a lot > of effort and thinking behind them, and if they have ported over to V2, > then they can serve as excellent production examples of V2 API. > > > > Thanks, > > Jayesh > > > > *F

Re: Datasource v2 Select Into support

2018-09-19 Thread Ryan Blue
er >> +- Project [Mort AS Mort#7, 1000 AS 1000#8] >>+- OneRowRelation >> >> My DefaultSource V2 implementation extends DataSourceV2 with ReadSupport >> with ReadSupportWithSchema with WriteSupport >> >> I'm wondering if there is something I'm not implementing, or if there is >> a bug in my implementation or its an issue with Spark? >> >> Any pointers would be great, >> >> Ross >> > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
m generically in the API, > allowing pass-through commands to manipulate them, or by some other > means. > > Regards, > Dale. > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-19 Thread Ryan Blue
amespace that needs it. >> >> If the data source requires TLS support then we also need to support >> passing >> all the configuration values under "spark.ssl.*" >> >> What do people think? Placeholder Issue has been added at SPARK-25329. >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
quot;commit" would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always be possible for the >>>>>> driver to take over the transact

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
;>>>> receives "prepared" from all the tasks, a "commit" would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
and is >> discoverable - thereby breaking the documented contract. >> >> I was wondering how other databases systems plan to implement this API >> and meet the contract as per the Javadoc? >> >> Many thanks >> >> Ross >> > -- Ryan Blue Software Engineer Netflix

Re: Branch 2.4 is cut

2018-09-09 Thread Ryan Blue
gt; > > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Ryan Blue Software Engineer Netflix

Re: data source api v2 refactoring

2018-09-07 Thread Ryan Blue
8 at 3:02 PM Hyukjin Kwon wrote: > >> BTW, do we hold Datasource V2 related PRs for now until we finish this >> refactoring just for clarification? >> >> 2018년 9월 7일 (금) 오전 12:52, Ryan Blue 님이 작성: >> >>> Wenchen, >>> >>> I'm not really su

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
gt; state. > > > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue > wrote: > >> It would be great to get more features out incrementally. For >> experimental features, do we have more relaxed constraints? >> >> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin wrote: >

Re: data source api v2 refactoring

2018-09-06 Thread Ryan Blue
> trait Table { > LogicalWrite newAppendWrite(); > > LogicalWrite newDeleteWrite(deleteExprs); > } > > > It looks to me that the API is simpler without WriteConfig, what do you > think? > > Thanks, > Wenchen > > On Wed, Sep 5, 2018 at 4:24 AM Ryan Blue > wrote:

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
also jettison a lot of >> older dependencies, code, fix some long standing issues, etc. >> >> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4) >> >> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue >> wrote: >> >>> My concern is that the v2 data

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
a major version update to get it? > > I generally support moving on to 3.x so we can also jettison a lot of > older dependencies, code, fix some long standing issues, etc. > > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4) > > On Thu, Sep 6, 2018 at 9:10 AM

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
dates for >>>>>> consideration): >>>>>> >>> >>>>>> >>> 1. Support Scala 2.12. >>>>>> >>> >>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) >>>>>> deprecated in Spark 2.x. >>>>>> >>> >>>>>> >>> 3. Shade all dependencies. >>>>>> >>> >>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL >>>>>> compliant, to prevent users from shooting themselves in the foot, e.g. >>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it >>>>>> less painful for users to upgrade here, I’d suggest creating a flag for >>>>>> backward compatibility mode. >>>>>> >>> >>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL >>>>>> more standard compliant, and have a flag for backward compatibility. >>>>>> >>> >>>>>> >>> 6. Miscellaneous other small changes documented in JIRA already >>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, >>>>>> not >>>>>> Iterator”, “Prevent column name duplication in temporary view”). >>>>>> >>> >>>>>> >>> >>>>>> >>> Now the reality of a major version bump is that the world often >>>>>> thinks in terms of what exciting features are coming. I do think there >>>>>> are >>>>>> a number of major changes happening already that can be part of the 3.0 >>>>>> release, if they make it in: >>>>>> >>> >>>>>> >>> 1. Scala 2.12 support (listing it twice) >>>>>> >>> 2. Continuous Processing non-experimental >>>>>> >>> 3. Kubernetes support non-experimental >>>>>> >>> 4. A more flushed out version of data source API v2 (I don’t >>>>>> think it is realistic to stabilize that in one release) >>>>>> >>> 5. Hadoop 3.0 support >>>>>> >>> 6. ... >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> Similar to the 2.0 discussion, this thread should focus on the >>>>>> framework and whether it’d make sense to create Spark 3.0 as the next >>>>>> release, rather than the individual feature requests. Those are important >>>>>> but are best done in their own separate threads. >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> +1 -224-436-0783 >>>> Greater Chicago >>>> >>> >>> >>> >>> -- >>> Regards, >>> Vaquar Khan >>> +1 -224-436-0783 >>> Greater Chicago >>> >> >> >> >> -- >> Regards, >> Vaquar Khan >> +1 -224-436-0783 >> Greater Chicago >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Ryan Blue
1 gives Spark the opportunity > to enforce column references are valid (but not the actual function names), > whereas option 2 would be up to the data sources to validate. > > > > On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue wrote: > >> I think I found a good solution to th

Fwd: data source api v2 refactoring

2018-09-04 Thread Ryan Blue
Latest from Wenchen in case it was dropped. -- Forwarded message - From: Wenchen Fan Date: Mon, Sep 3, 2018 at 6:16 AM Subject: Re: data source api v2 refactoring To: Cc: Ryan Blue , Reynold Xin , < dev@spark.apache.org> Hi Mridul, I'm not sure what's going on, my

Re: data source api v2 refactoring

2018-09-01 Thread Ryan Blue
with ScanConfig. > For streaming source, stream is the one to take care of the pushdown > result. For batch source, it's the scan. > > It's a little tricky because stream is an abstraction for streaming source > only. Better ideas are welcome! > > On Sat, Sep 1, 2018 at 7:26 AM Ry

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Ryan Blue
th the > above? > > At a high level, I think the Heilmeier's Catechism emphasizes less about > the "how", and more the "why" and "what", which is what I'd argue SPIPs > should be about. The hows should be left in design docs for larger projects. > > > -- Ryan Blue Software Engineer Netflix

Re: data source api v2 refactoring

2018-08-31 Thread Ryan Blue
the above: >> >> 1. Creates an explicit Table abstraction, and an explicit Scan >> abstraction. >> >> 2. Have an explicit Stream level and makes it clear pushdowns and options >> are handled there, rather than at the individual scan (ReadSupport) level. >> Data source implementations don't need to worry about pushdowns or options >> changing mid-stream. For batch, those happen when the scan object is >> created. >> >> >> >> This email is just a high level sketch. I've asked Wenchen to prototype >> this, to see if it is actually feasible and the degree of hacks it removes, >> or creates. >> >> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-21 Thread Ryan Blue
entation > > Thanks for your time, > Russ > > On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue > wrote: > >> Thanks for posting this discussion to the dev list, it would be great to >> hear what everyone thinks about the idea that USING should be a >> catalog-specific

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Ryan Blue
e should be supported anyway, I was > thinking we could just orthogonally proceed. If you guys think other issues > should be resolved first, I think we (at least I will) should take a look > for the set of catalog APIs. > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-15 Thread Ryan Blue
I think I found a good solution to the problem of using Expression in the TableCatalog API and in the DeleteSupport API. For DeleteSupport, there is already a stable and public subset of Expression named Filter that can be used to pass filters. The reason why DeleteSupport would use Expression is

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-15 Thread Ryan Blue
PI, similar to what we did for > dsv1. > > If we are depending on Expressions on the more common APIs in dsv2 > already, we should revisit that. > > > > > On Mon, Aug 13, 2018 at 1:59 PM Ryan Blue wrote: > >> Reynold, did you get a chance to look at my response about

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-13 Thread Ryan Blue
. Anyone else want to raise an issue with the proposal, or is it about time to bring up a vote thread? rb On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue wrote: > I don’t think that we want to block this work until we have a public and > stable Expression. Like our decision to expose Internal

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Ryan Blue
rce can provide catalog > functionalities. > > Under the hood, I feel this proposal is very similar to my second > proposal, except that a catalog implementation must provide a default data > source/storage, and different rule for looking up tables. > > > On Sun, Jul 29,

Re: [DISCUSS] Multiple catalog support

2018-07-29 Thread Ryan Blue
Wenchen, what I'm suggesting is a bit of both of your proposals. I think that USING should be optional like your first option. USING (or format(...) in the DF side) should configure the source or implementation, while the catalog should be part of the table identifier. They serve two different

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-07-26 Thread Ryan Blue
ersions. > > > On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue > wrote: > >> The recently adopted SPIP to standardize logical plans requires a way for >> to plug in providers for table metadata operations, so that the new plans >> can create and drop tables. I proposed an A

Re: [DISCUSS] Multiple catalog support

2018-07-25 Thread Ryan Blue
Quick update: I've updated my PR to add the table catalog API to implement this proposal. Here's the PR: https://github.com/apache/spark/pull/21306 On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue wrote: > Lately, I’ve been working on implementing the new SQL logical plans. I’m > currently b

[DISCUSS] SPIP: APIs for Table Metadata Operations

2018-07-24 Thread Ryan Blue
IP is for the APIs and does not cover how multiple catalogs would be exposed. I started a separate discussion thread on how to access multiple catalogs and maintain compatibility with Spark’s current behavior (how to get the catalog instance in the above example). Please use this thread to discuss the proposed APIs. Thanks, everyone! rb ​ -- Ryan Blue Software Engineer Netflix

[DISCUSS] Multiple catalog support

2018-07-23 Thread Ryan Blue
continue to use the property to determine the table’s data source or format implementation. Other table catalog implementations would be free to interpret the format string as they choose or to use it to choose a data source implementation as in the default catalog. rb ​ -- Ryan Blue Software Engineer Netflix

[RESULT] [VOTE] SPIP: Standardize SQL logical plans

2018-07-20 Thread Ryan Blue
This vote passes with 4 binding +1s and 9 community +1s. Thanks for taking the time to vote, everyone! Binding votes: Wenchen Fan Xiao Li Reynold Xin Felix Cheung Non-binding votes: Ryan Blue John Zhuge Takeshi Yamamuro Marco Gaido Russel Spitzer Alessandro Solimando Henry Robinson Dongjoon

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-18 Thread Ryan Blue
;>>>>> Note. RC2 was cancelled because of one blocking issue SPARK-24781 >>>>>>> during release preparation. >>>>>>> >>>>>>> FAQ >>>>>>> >>>>>>> = >>>>>>> How can I help test this release? >>>>>>> = >>>>>>> >>>>>>> If you are a Spark user, you can help us test this release by taking >>>>>>> an existing Spark workload and running on this release candidate, >>>>>>> then >>>>>>> reporting any regressions. >>>>>>> >>>>>>> If you're working in PySpark you can set up a virtual env and install >>>>>>> the current RC and see if anything important breaks, in the >>>>>>> Java/Scala >>>>>>> you can add the staging repository to your projects resolvers and >>>>>>> test >>>>>>> with the RC (make sure to clean up the artifact cache before/after so >>>>>>> you don't end up building with a out of date RC going forward). >>>>>>> >>>>>>> === >>>>>>> What should happen to JIRA tickets still targeting 2.3.2? >>>>>>> === >>>>>>> >>>>>>> The current list of open tickets targeted at 2.3.2 can be found at: >>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>>>>> Version/s" = 2.3.2 >>>>>>> >>>>>>> Committers should look at those and triage. Extremely important bug >>>>>>> fixes, documentation, and API tweaks that impact compatibility should >>>>>>> be worked on immediately. Everything else please retarget to an >>>>>>> appropriate release. >>>>>>> >>>>>>> == >>>>>>> But my bug isn't fixed? >>>>>>> == >>>>>>> >>>>>>> In order to make timely releases, we will typically not hold the >>>>>>> release unless the bug in question is a regression from the previous >>>>>>> release. That being said, if there is something which is a regression >>>>>>> that has not been correctly targeted please ping me or a committer to >>>>>>> >>>>>> help target the issue. >>>>>>> >>>>>> >>>>>>> >>>>>>> -- >>>>>>> John Zhuge >>>>>>> >>>>>> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread Ryan Blue
+1 (not binding) On Tue, Jul 17, 2018 at 10:59 AM Ryan Blue wrote: > Hi everyone, > > From discussion on the proposal doc and the discussion thread, I think we > have consensus around the plan to standardize logical write operations for > DataSourceV2. I would like

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-17 Thread Ryan Blue
ple can > jump > > in during the development. I'm interested in the new API and like to > work on > > it after the vote passes. > > > > Thanks, > > Wenchen > > > > On Fri, Jul 13, 2018 at 7:25 AM Ryan Blue wrote: > >> > >> Thanks! I'm a

[VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread Ryan Blue
rk should adopt the SPIP [-1]: Spark should not adopt the SPIP because . . . Thanks for voting, everyone! -- Ryan Blue

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-12 Thread Ryan Blue
hanks, > Wenchen > > On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue > wrote: > >> Hi everyone, >> >> A few weeks ago, I wrote up a proposal to standardize SQL logical plans >> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?

Re: Time for 2.3.2?

2018-06-28 Thread Ryan Blue
gt;>>>>> stream-stream >>>>>> > join. Users can hit this bug if one of the join side is partitioned >>>>>> by a >>>>>> > subset of the join keys. >>>>>> > >>>>>> > SPARK-24552: Task at

Re: Very slow complex type column reads from parquet

2018-06-18 Thread Ryan Blue
level of parallelism (searching for a given object id when sorted > by time needs to scan all/more the groups for larger times). > One question here - is Parquet reader reading & decoding the projection > columns even if the predicate columns should filter the record out? > > Unfortunatel

Re: Very slow complex type column reads from parquet

2018-06-12 Thread Ryan Blue
e know if there is anybody currently working on it > or maybe you have it in a roadmap for the future? > Or maybe you could give me some suggestions how to avoid / resolve this > problem? I’m using Spark 2.2.1. > > Best regards, > Jakub Wozniak > > > > >

Re: eager execution and debuggability

2018-05-21 Thread Ryan Blue
uot; nodes and see how many jobs > was triggered and number of tasks and their duration. Now it's hard to > debug it, especially for newbies. > > Pozdrawiam / Best regards, > Tomek Gawęda > > > On 2018-05-10 18:31, Ryan Blue wrote: > > > it would be fantastic if

Re: Time for 2.3.1?

2018-05-11 Thread Ryan Blue
(by > >> replying here or updating the bug in Jira), otherwise I'm volunteering > >> to prepare the first RC soon-ish (around the weekend). > >> > >> Thanks! > >> > >> > >> -- > >> Marcelo > >> > >> -----

Re: Time for 2.3.1?

2018-05-10 Thread Ryan Blue
e weekend). > > Thanks! > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: eager execution and debuggability

2018-05-10 Thread Ryan Blue
end that triggered the >>> error. >>> >>> I don’t know how feasible this is, but addressing it would directly >>> solve the issue of linking failures to the responsible transformation, as >>> opposed to leaving the user to break up a chain of trans

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Ryan Blue
ent types of rows, so we forced > the conversion at input. > > Can't your "wish" be satisfied by having the public API producing the > internals of UnsafeRow (without actually exposing UnsafeRow)? > > > On Tue, May 8, 2018 at 4:16 PM Ryan Blue <rb...@netflix.com>

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Ryan Blue
, Reynold Xin <r...@databricks.com> wrote: > What the internal operators do are strictly internal. To take one step > back, is the goal to design an API so the consumers of the API can directly > produces what Spark expects internally, to cut down perf cost? > > > On Tue, May 8, 2

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
I've opened SPARK-24215 to track this. On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote: > Yup. Sounds great. This is something simple Spark can do and provide huge > value to the end users. > > > On Tue, May 8, 2018 at 3:53 PM Ryan Blue <

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
are tools/ways to force the >>> execution, helping in the debugging phase. So they can achieve without a >>> big effort the same result, but with a big difference: they are aware of >>> what is really happening, which may help them later. >>> >>> Thanks

[DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Ryan Blue
nalRow, then there is an easy performance win that also simplifies the v2 data source API. rb ​ -- Ryan Blue Software Engineer Netflix

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
also struggle in similar >> ways as these students. While eager execution is really not practical in >> big data, in learning environments or in development against small, sampled >> datasets it can be pretty helpful. >> >> >> >> >> >> >> >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-03 Thread Ryan Blue
or shouldn't > come. Let me know if this understanding is correct > > On Tue, May 1, 2018 at 9:37 PM, Ryan Blue <rb...@netflix.com> wrote: > >> This is usually caused by skew. Sometimes you can work around it by in >> creasing the number of partitions like you tri

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Ryan Blue
re to Spark users. So a design that we can't migrate > file sources to without a side channel would be worrying; won't we end up > regressing to the same situation? > > On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue <rb...@netflix.com> wrote: > >> Should we really plan the API fo

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
kFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349) > > -- Ryan Blue Software Engineer Netflix

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
has ever read. In order > to parse this efficiently, the stream connector needs detailed control over > how it's stored; the current implementation even has complex > compactification and retention logic. > > > On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue <rb...@netflix.com> wrot

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
t;> >> After enabling checkpointing, I do see a folder being created under the >> checkpoint folder, but there's nothing else in there. >> >> >> >> Same question for write-ahead and recovery? >> >> And on a restart from a failed streaming session - who should set the >> offsets? >> >> The driver/Spark or the datasource? >> >> >> >> Any pointers to design docs would also be greatly appreciated. >> >> >> >> Thanks, >> >> Jayesh >> >> >> > > -- Ryan Blue Software Engineer Netflix

Re: Correlated subqueries in the DataFrame API

2018-04-19 Thread Ryan Blue
ache.org/jira/browse/SPARK-18455>, but it's not clear > to me whether they are "design-appropriate" for the DataFrame API. > > Are correlated subqueries a thing we can expect to have in the DataFrame > API? > > Nick > > -- Ryan Blue Software Engineer Netflix

<    1   2   3   4   >