Datasource V2 support in Spark 3.x

2020-02-28 Thread Mihir Sahu
Hi Team, Wanted to know ahead of developing new datasource for Spark 3.x. Shall it be done using Datasource V2 or Datasource V1(via Relation) or there is any other plan. When I tried to build datasource using V2 for Spark 3.0, I could not find the associated classes and they seems to be

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-02-28 Thread Sean Owen
I'll admit, I didn't know you could deploy multiple workers per machine. I agree, I don't see the use case for it? multiple executors, yes of course. And I guess you could imagine multiple distinct Spark clusters running a worker on one machine. I don't have an informed opinion therefore, but

[DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-02-28 Thread Xingbo Jiang
Hi all, Based on my experience, there is no scenario that necessarily requires deploying multiple Workers on the same node with Standalone backend. A worker should book all the resources reserved to Spark on the host it is launched, then it can allocate those resources to one or more executors

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-28 Thread Sean Owen
On Fri, Feb 28, 2020 at 12:03 PM Holden Karau wrote: >> 1. Could you estimate how many revert commits are required in >> `branch-3.0` for new rubric? Fair question about what actual change this implies for 3.0? so far it seems like some targeted, quite reasonable reverts. I don't think

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-28 Thread Holden Karau
On Fri, Feb 28, 2020 at 9:48 AM Dongjoon Hyun wrote: > Hi, Matei and Michael. > > I'm also a big supporter for policy-based project management. > > Before going further, > > 1. Could you estimate how many revert commits are required in > `branch-3.0` for new rubric? > 2. Are you going to

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-28 Thread Dongjoon Hyun
Hi, Matei and Michael. I'm also a big supporter for policy-based project management. Before going further, 1. Could you estimate how many revert commits are required in `branch-3.0` for new rubric? 2. Are you going to revert all removed test cases for the deprecated ones? 3. Does it

Re: GitHub action permissions

2020-02-28 Thread Tom Graves
No, I couldn't see that button, looks like the process of syncing in gitbox didn't finish with my accounts.  I finished that and its working now. Thanks,Tom On Friday, February 28, 2020, 09:39:12 AM CST, Dongjoon Hyun wrote: Hi, Thomas. If you log-in with a GitHub account registered

Re: GitHub action permissions

2020-02-28 Thread Dongjoon Hyun
Hi, Thomas. If you log-in with a GitHub account registered Apache project member, it will be enough. On some PRs of Apache Spark, can you see 'Squash and merge' button? Bests, Dongjoon On Fri, Feb 28, 2020 at 07:15 Thomas graves wrote: > Does anyone know how the GitHub action permissions

GitHub action permissions

2020-02-28 Thread Thomas graves
Does anyone know how the GitHub action permissions are setup? I see a lot of random failures and want to be able to rerun them, but I don't seem to have a "rerun" button like some folks do. Thanks, Tom - To unsubscribe e-mail:

Keytab, Proxy User & Principal

2020-02-28 Thread Lars Francke
Hi, I understand that we forbid specifying "principal" & "proxy user" at the same time because the current logic would just stage the keytab and the proxy user could then use that to gain full access circumventing any security. But we have a use-case for Livy where a different semantic would be

Re: dropDuplicates and watermark in structured streaming

2020-02-28 Thread Tathagata Das
why do you have two watermarks? once you apply the watermark to a column (i.e., "time"), it can be used in all later operations as long as the column is preserved. So the above code should be equivalent to df.withWarmark("time","window size").dropDulplicates("id").groupBy(window("time","window