Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-30 Thread Wenchen Fan
Hi Carson and Yuanjian,

Thanks for contributing to this project and sharing the production use
cases! I believe the adaptive execution will be a very important feature of
Spark SQL and will definitely benefit a lot of users.

I went through the design docs and the high-level design totally makes
sense to me. Since the code freeze of Spark 2.4 is close, I'm afraid we may
not have enough time to review the code and merge it, how about we target
this feature to Spark 3.0?

Besides, it would be great if we can have some real benchmark numbers for
it.

Thanks,
Wenchen

On Tue, Jul 31, 2018 at 2:26 PM Yuanjian Li  wrote:

> Thanks Carson, great note!
> Actually Baidu has ported this patch in our internal folk. I collected
> some user cases and performance improve effect during Baidu internal usage
> of this patch, summarize as following 3 scenario:
> 1. SortMergeJoin to BroadcastJoin
> The SortMergeJoin transform to BroadcastJoin over deeply tree node can
> bring us 50% to 200% boosting on query performance, and this strategy alway
> hit the BI scenario like join several tables with filter strategy in
> subquery
> 2. Long running application or use Spark as a service
> In this case, long running application refers to the duration of
> application near 1 hour. Using Spark as a service refers to use spark-shell
> and keep submit sql or use the service of Spark like Zeppelin, Livy or our
> internal sql service Baidu BigSQL. In such scenario, all spark jobs share
> same partition number, so enable AE and add configs about expected task
> info including data size, row number, min\max partition number and etc,
> will bring us 50%-100% boosting on performance improvement.
> 3. GraphFrame jobs
> The last scenario is the application use GraphFrame, in this case, user
> has a 2-dimension graph with 1 billion edges, use the connected
> componentsalgorithm in GraphFrame. With enabling AE, the duration of app
> reduce from 58min to 32min, almost 100% boosting on performance improvement.
>
> The detailed screenshot and config in the JIRA SPARK-23128
>  attached pdf.
>
> Thanks,
> Yuanjian Li
>
> Wang, Carson  于2018年7月28日周六 上午12:49写道:
>
>> Dear all,
>>
>>
>>
>> The initial support of adaptive execution[SPARK-9850
>> ] in Spark SQL has
>> been there since Spark 1.6, but there is no more update since then. One of
>> the key features in adaptive execution is to determine the number of
>> reducer automatically at runtime. This is a feature required by many Spark
>> users especially the infrastructure team in many companies, as there are
>> thousands of queries running on the cluster where the shuffle partition
>> number may not be set properly for every query. The same shuffle partition
>> number also doesn’t work well for all stages in a query because each stage
>> has different input data size. Other features in adaptive execution include
>> optimizing join strategy at runtime and handling skewed join automatically,
>> which have not been implemented in Spark.
>>
>>
>>
>> In the current implementation, an Exchange coordinator is used to
>> determine the number of post-shuffle partitions for a stage. However,
>> exchange coordinator is added when Exchange is being added, so it actually
>> lacks a global picture of all shuffle dependencies of a post-shuffle
>> stage.  I.e. for 3 tables’ join in a single stage, the same
>> ExchangeCoordinator should be used in three Exchanges but currently two
>> separated ExchangeCoordinator will be added. It also adds additional
>> Exchanges in some cases. So I think it is time to rethink how to better
>> support adaptive execution in Spark SQL. I have proposed a new approach in
>> SPARK-23128 . A
>> document about the idea is described at here
>> .
>> The idea about how to changing a sort merge join to a broadcast hash join
>> at runtime is also described in a separated doc
>> .
>>
>>
>>
>>
>> The docs have been there for a while, and I also had an implementation
>> based on Spark 2.3 available at
>> https://github.com/Intel-bigdata/spark-adaptive. The code is split into
>> 7 PRs labeled with AE2.3-0x if you look at the pull requests. I asked many
>> partners to evaluate the patch including Baidu, Alibaba, JD.com, etc and
>> received very good feedback. Baidu also shared their result at the Jira. We
>> also finished a 100 TB TPC-DS benchmark earlier using the patch which
>> passed all queries with good performance improvement.
>>
>>
>>
>> I’d like to call for a review on the docs and even code and we can
>> further discuss in this thread. Thanks very much!
>>
>>
>>
>> Thanks,
>>
>> Carson
>>
>>
>>
>


Re: Data source V2

2018-07-30 Thread Wenchen Fan
Hi assaf,

Thanks for trying data source v2! Data source v2 is still evolving(we
marked all the data source v2 interface as @Evolving), and we've already
made a lot of API changes in this release(some renaming, switching to
InternalRow, etc.). So I'd not encourage people to use data source v2 in
long-term productions until we mark data source v2 as stable(or
experimental at least). SPARK-24882 is also an API change, and I'd say
people should implement data source after it gets merged or rejected.

About metrics, it should be easy to add a mixin interface to report metrics.

Thanks,
Wenchen

On Tue, Jul 31, 2018 at 2:07 PM assaf.mendelson 
wrote:

> Hi all,
> I am currently in the middle of developing a new data source (for an
> internal tool) using data source V2.
> I noticed that  SPARK-24882
>    is planned for 2.4
> and
> includes interface changes.
>
> I was wondering if those are planned in addition to the current interfaces
> or are aimed to replace them (specifically the most basic reading as this
> is
> what I am using).
>
> As a side note, I was wondering if there is any means to expose metrics
> from
> the data source, e.g. I would like to expose a metric of the number of rows
> read to the application (currently I am adding a per partition index column
> and doing a custom idempotent accumulator which collects the maximum index
> for each partition).
>
> Thanks,
> Assaf.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-30 Thread Yuanjian Li
Thanks Carson, great note!
Actually Baidu has ported this patch in our internal folk. I collected some
user cases and performance improve effect during Baidu internal usage of
this patch, summarize as following 3 scenario:
1. SortMergeJoin to BroadcastJoin
The SortMergeJoin transform to BroadcastJoin over deeply tree node can
bring us 50% to 200% boosting on query performance, and this strategy alway
hit the BI scenario like join several tables with filter strategy in
subquery
2. Long running application or use Spark as a service
In this case, long running application refers to the duration of
application near 1 hour. Using Spark as a service refers to use spark-shell
and keep submit sql or use the service of Spark like Zeppelin, Livy or our
internal sql service Baidu BigSQL. In such scenario, all spark jobs share
same partition number, so enable AE and add configs about expected task
info including data size, row number, min\max partition number and etc,
will bring us 50%-100% boosting on performance improvement.
3. GraphFrame jobs
The last scenario is the application use GraphFrame, in this case, user has
a 2-dimension graph with 1 billion edges, use the connected
componentsalgorithm in GraphFrame. With enabling AE, the duration of app
reduce from 58min to 32min, almost 100% boosting on performance improvement.

The detailed screenshot and config in the JIRA SPARK-23128
 attached pdf.

Thanks,
Yuanjian Li

Wang, Carson  于2018年7月28日周六 上午12:49写道:

> Dear all,
>
>
>
> The initial support of adaptive execution[SPARK-9850
> ] in Spark SQL has been
> there since Spark 1.6, but there is no more update since then. One of the
> key features in adaptive execution is to determine the number of reducer
> automatically at runtime. This is a feature required by many Spark users
> especially the infrastructure team in many companies, as there are
> thousands of queries running on the cluster where the shuffle partition
> number may not be set properly for every query. The same shuffle partition
> number also doesn’t work well for all stages in a query because each stage
> has different input data size. Other features in adaptive execution include
> optimizing join strategy at runtime and handling skewed join automatically,
> which have not been implemented in Spark.
>
>
>
> In the current implementation, an Exchange coordinator is used to
> determine the number of post-shuffle partitions for a stage. However,
> exchange coordinator is added when Exchange is being added, so it actually
> lacks a global picture of all shuffle dependencies of a post-shuffle
> stage.  I.e. for 3 tables’ join in a single stage, the same
> ExchangeCoordinator should be used in three Exchanges but currently two
> separated ExchangeCoordinator will be added. It also adds additional
> Exchanges in some cases. So I think it is time to rethink how to better
> support adaptive execution in Spark SQL. I have proposed a new approach in
> SPARK-23128 . A
> document about the idea is described at here
> .
> The idea about how to changing a sort merge join to a broadcast hash join
> at runtime is also described in a separated doc
> .
>
>
>
>
> The docs have been there for a while, and I also had an implementation
> based on Spark 2.3 available at
> https://github.com/Intel-bigdata/spark-adaptive. The code is split into 7
> PRs labeled with AE2.3-0x if you look at the pull requests. I asked many
> partners to evaluate the patch including Baidu, Alibaba, JD.com, etc and
> received very good feedback. Baidu also shared their result at the Jira. We
> also finished a 100 TB TPC-DS benchmark earlier using the patch which
> passed all queries with good performance improvement.
>
>
>
> I’d like to call for a review on the docs and even code and we can further
> discuss in this thread. Thanks very much!
>
>
>
> Thanks,
>
> Carson
>
>
>


Re: Review notification bot

2018-07-30 Thread Holden Karau
The activeness is a thing that came up in the Beam project POC I'm doing
for the same bot (filtered it down to contributors active in the last year
only).

On Mon, Jul 30, 2018 at 11:08 PM, Jungtaek Lim  wrote:

> Sorry to chime in, just 2 cents on this since it looks like interesting
> topic.
>
> Just to share my habit as a one of contributors (for various projects), I
> don't take "git history" or "git blame" to find authors of file and ping
> for review. I just ping for active committers who recently merged the pull
> requests (as well as active contributors) for specific component, assuming
> committers don't merge the patches blindly so they have overall
> understanding of codebase for component. I guess it is not necessary for
> individual committer to cover whole codebase of a component, but ideally
> active committers for a component should be able to cover whole codebase of
> a component.
>
> In contributors' point of view, the main concern is who can be "merger"
> for my patch. 100s or comments from contributors would make code better but
> it doesn't make the actual change if at least one of committers who can be
> a merger jumps into the PR and reviews.
>
> I love the concept of leading existing contributors to review the codebase
> they know about. One thing which may be worth to also consider is, in open
> source project, it is very common for individual to (implicitly or
> explicitly) stop contributing the project for various reason, so concerning
> activeness (or date of commit) would be ideal.
>
> I admit above things might be ideal rather than realistic, but just think
> out loud to see review notification bot more useful for contributors and
> less annoyed for someone.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 7월 31일 (화) 오후 2:46, Holden Karau 님이 작성:
>
>> On Mon, Jul 30, 2018 at 10:22 PM, Reynold Xin 
>> wrote:
>>
>>> I like the idea of this bot, but I'm somewhat annoyed by it. I have
>>> touched a lot of files and wrote a lot of the original code. Everyday I
>>> wake up I get a lot of emails from this bot.
>>>
>> We could blacklist the existing PMC (or add a rate limit)?
>>
>>>
>>> Also if we are going to use this, can we rename the bot to something
>>> like spark-bot, rather than holden's personal bot?
>>>
>> I originally did that, but GitHub told me I could only have one personal
>> and one bot account. If someone else registered the spark-mention-bot I'd
>> be happy to switch it to that.
>>
>>>
>>> On Mon, Jul 30, 2018 at 10:18 PM Hyukjin Kwon 
>>> wrote:
>>>
 > That being said the folks being pinged are not just committers.

 I doubt it because only pinged ones I see are all committers and that's
 why I assumed the pinging is based on who committed the PR (which implies
 committer only).
 Do you maybe have some examples where non-committers were pinged? Looks
 at least, (almost?) all of them are committers and something needs to be
 fixed even if so.

 I recently argued about pinging things before - sounds it matters if it
 annoys. Since pinging is completely optional and cc'ing someone else might
 need other contexts not
 only assuming from the blame and who committed this, I am actually not
 super happy with that pinging for now. I was slightly supportive for this
 idea but now I actually slightly
 became negative on this after observing how it goes in practice.

 I wonder how other people think on this.



 2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:

> So CODEOWNERS is limited to committers by GitHub. We can definitely
> modify the config file though and I'm happy to write some custom logic if
> it helps support our needs. We can also just turn it off if it's too 
> noisey
> for folks in general.
>
> That being said the folks being pinged are not just committers. The
> hope is to get more code authors who aren't committers involved in the
> reviews and then eventually become committers.
>
> On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon 
> wrote:
>
>> *reviewers: I mean people who committed the PR given my observation.
>>
>> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>>
>>> I was wondering if we can leave the configuration open and accept
>>> some custom configurations, IMHO, because I saw some people less 
>>> related or
>>> less active are consistently pinged. Just started to get worried if they
>>> get annoyed by this.
>>> Also, some people could be interested in few specific areas. They
>>> should get pinged too.
>>> Also, assuming from people pinged, seems they are reviewers (which
>>> basically means committers I guess). Was wondering if there's a big
>>> difference between codeowners and bots.
>>>
>>>
>>>
>>> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>>>
 Th configuration file is optional, is there something you want to
 try

Re: Review notification bot

2018-07-30 Thread Jungtaek Lim
Sorry to chime in, just 2 cents on this since it looks like interesting
topic.

Just to share my habit as a one of contributors (for various projects), I
don't take "git history" or "git blame" to find authors of file and ping
for review. I just ping for active committers who recently merged the pull
requests (as well as active contributors) for specific component, assuming
committers don't merge the patches blindly so they have overall
understanding of codebase for component. I guess it is not necessary for
individual committer to cover whole codebase of a component, but ideally
active committers for a component should be able to cover whole codebase of
a component.

In contributors' point of view, the main concern is who can be "merger" for
my patch. 100s or comments from contributors would make code better but it
doesn't make the actual change if at least one of committers who can be a
merger jumps into the PR and reviews.

I love the concept of leading existing contributors to review the codebase
they know about. One thing which may be worth to also consider is, in open
source project, it is very common for individual to (implicitly or
explicitly) stop contributing the project for various reason, so concerning
activeness (or date of commit) would be ideal.

I admit above things might be ideal rather than realistic, but just think
out loud to see review notification bot more useful for contributors and
less annoyed for someone.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 31일 (화) 오후 2:46, Holden Karau 님이 작성:

> On Mon, Jul 30, 2018 at 10:22 PM, Reynold Xin  wrote:
>
>> I like the idea of this bot, but I'm somewhat annoyed by it. I have
>> touched a lot of files and wrote a lot of the original code. Everyday I
>> wake up I get a lot of emails from this bot.
>>
> We could blacklist the existing PMC (or add a rate limit)?
>
>>
>> Also if we are going to use this, can we rename the bot to something like
>> spark-bot, rather than holden's personal bot?
>>
> I originally did that, but GitHub told me I could only have one personal
> and one bot account. If someone else registered the spark-mention-bot I'd
> be happy to switch it to that.
>
>>
>> On Mon, Jul 30, 2018 at 10:18 PM Hyukjin Kwon 
>> wrote:
>>
>>> > That being said the folks being pinged are not just committers.
>>>
>>> I doubt it because only pinged ones I see are all committers and that's
>>> why I assumed the pinging is based on who committed the PR (which implies
>>> committer only).
>>> Do you maybe have some examples where non-committers were pinged? Looks
>>> at least, (almost?) all of them are committers and something needs to be
>>> fixed even if so.
>>>
>>> I recently argued about pinging things before - sounds it matters if it
>>> annoys. Since pinging is completely optional and cc'ing someone else might
>>> need other contexts not
>>> only assuming from the blame and who committed this, I am actually not
>>> super happy with that pinging for now. I was slightly supportive for this
>>> idea but now I actually slightly
>>> became negative on this after observing how it goes in practice.
>>>
>>> I wonder how other people think on this.
>>>
>>>
>>>
>>> 2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:
>>>
 So CODEOWNERS is limited to committers by GitHub. We can definitely
 modify the config file though and I'm happy to write some custom logic if
 it helps support our needs. We can also just turn it off if it's too noisey
 for folks in general.

 That being said the folks being pinged are not just committers. The
 hope is to get more code authors who aren't committers involved in the
 reviews and then eventually become committers.

 On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:

> *reviewers: I mean people who committed the PR given my observation.
>
> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>
>> I was wondering if we can leave the configuration open and accept
>> some custom configurations, IMHO, because I saw some people less related 
>> or
>> less active are consistently pinged. Just started to get worried if they
>> get annoyed by this.
>> Also, some people could be interested in few specific areas. They
>> should get pinged too.
>> Also, assuming from people pinged, seems they are reviewers (which
>> basically means committers I guess). Was wondering if there's a big
>> difference between codeowners and bots.
>>
>>
>>
>> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>>
>>> Th configuration file is optional, is there something you want to
>>> try and change?
>>>
>>> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
>>> wrote:
>>>
 I see. Thanks. I was wondering if I can see the configuration file
 since that looks needed (
 https://github.com/holdenk/mention-bot#configuration) but I
 couldn't find (sorry if it's just something I simply missed).


Data source V2

2018-07-30 Thread assaf.mendelson
Hi all,
I am currently in the middle of developing a new data source (for an
internal tool) using data source V2.
I noticed that  SPARK-24882
   is planned for 2.4 and
includes interface changes.

I was wondering if those are planned in addition to the current interfaces
or are aimed to replace them (specifically the most basic reading as this is
what I am using).

As a side note, I was wondering if there is any means to expose metrics from
the data source, e.g. I would like to expose a metric of the number of rows
read to the application (currently I am adding a per partition index column
and doing a custom idempotent accumulator which collects the maximum index
for each partition). 

Thanks,
Assaf.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Review notification bot

2018-07-30 Thread Holden Karau
Another thing we could try and do (if folks would be down to try) is it
have not actually ping, but suggest the potential usernames to ping to the
user (e.g. say suggested reviewers you _may wish to ping_ and then list)?

On Mon, Jul 30, 2018 at 10:45 PM, Holden Karau  wrote:

>
> On Mon, Jul 30, 2018 at 10:22 PM, Reynold Xin  wrote:
>
>> I like the idea of this bot, but I'm somewhat annoyed by it. I have
>> touched a lot of files and wrote a lot of the original code. Everyday I
>> wake up I get a lot of emails from this bot.
>>
> We could blacklist the existing PMC (or add a rate limit)?
>
>>
>> Also if we are going to use this, can we rename the bot to something like
>> spark-bot, rather than holden's personal bot?
>>
> I originally did that, but GitHub told me I could only have one personal
> and one bot account. If someone else registered the spark-mention-bot I'd
> be happy to switch it to that.
>
>>
>> On Mon, Jul 30, 2018 at 10:18 PM Hyukjin Kwon 
>> wrote:
>>
>>> > That being said the folks being pinged are not just committers.
>>>
>>> I doubt it because only pinged ones I see are all committers and that's
>>> why I assumed the pinging is based on who committed the PR (which implies
>>> committer only).
>>> Do you maybe have some examples where non-committers were pinged? Looks
>>> at least, (almost?) all of them are committers and something needs to be
>>> fixed even if so.
>>>
>>> I recently argued about pinging things before - sounds it matters if it
>>> annoys. Since pinging is completely optional and cc'ing someone else might
>>> need other contexts not
>>> only assuming from the blame and who committed this, I am actually not
>>> super happy with that pinging for now. I was slightly supportive for this
>>> idea but now I actually slightly
>>> became negative on this after observing how it goes in practice.
>>>
>>> I wonder how other people think on this.
>>>
>>>
>>>
>>> 2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:
>>>
 So CODEOWNERS is limited to committers by GitHub. We can definitely
 modify the config file though and I'm happy to write some custom logic if
 it helps support our needs. We can also just turn it off if it's too noisey
 for folks in general.

 That being said the folks being pinged are not just committers. The
 hope is to get more code authors who aren't committers involved in the
 reviews and then eventually become committers.

 On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:

> *reviewers: I mean people who committed the PR given my observation.
>
> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>
>> I was wondering if we can leave the configuration open and accept
>> some custom configurations, IMHO, because I saw some people less related 
>> or
>> less active are consistently pinged. Just started to get worried if they
>> get annoyed by this.
>> Also, some people could be interested in few specific areas. They
>> should get pinged too.
>> Also, assuming from people pinged, seems they are reviewers (which
>> basically means committers I guess). Was wondering if there's a big
>> difference between codeowners and bots.
>>
>>
>>
>> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>>
>>> Th configuration file is optional, is there something you want to
>>> try and change?
>>>
>>> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
>>> wrote:
>>>
 I see. Thanks. I was wondering if I can see the configuration file
 since that looks needed (https://github.com/holdenk/me
 ntion-bot#configuration) but I couldn't find (sorry if it's just
 something I simply missed).

 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:

> So the one that is running is the the form in my own repo (set up
> for K8s deployment) - http://github.com/holdenk/mention-bot
>
> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
> wrote:
>
>> Holden, so, is it a fork in https://github.com/facebook
>> archive/mention-bot? Would you mind if I ask where I can see the
>> configurations for it?
>>
>>
>> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이
>> 작성:
>>
>>> Yeah so the issue with codeowners is it will only assign to
>>> committers on the repo (the Beam project found this out the 
>>> practical
>>> application way).
>>>
>>> I have a fork of mention bot running and it seems we can add it
>>> (need an infra ticket), but one of the things the Beam folks asked 
>>> was to
>>> not ping code authors who haven’t committed in the past year which 
>>> I need
>>> to do a bit of poking on to make happen.
>>>
>>> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>

Re: Review notification bot

2018-07-30 Thread Holden Karau
On Mon, Jul 30, 2018 at 10:22 PM, Reynold Xin  wrote:

> I like the idea of this bot, but I'm somewhat annoyed by it. I have
> touched a lot of files and wrote a lot of the original code. Everyday I
> wake up I get a lot of emails from this bot.
>
We could blacklist the existing PMC (or add a rate limit)?

>
> Also if we are going to use this, can we rename the bot to something like
> spark-bot, rather than holden's personal bot?
>
I originally did that, but GitHub told me I could only have one personal
and one bot account. If someone else registered the spark-mention-bot I'd
be happy to switch it to that.

>
> On Mon, Jul 30, 2018 at 10:18 PM Hyukjin Kwon  wrote:
>
>> > That being said the folks being pinged are not just committers.
>>
>> I doubt it because only pinged ones I see are all committers and that's
>> why I assumed the pinging is based on who committed the PR (which implies
>> committer only).
>> Do you maybe have some examples where non-committers were pinged? Looks
>> at least, (almost?) all of them are committers and something needs to be
>> fixed even if so.
>>
>> I recently argued about pinging things before - sounds it matters if it
>> annoys. Since pinging is completely optional and cc'ing someone else might
>> need other contexts not
>> only assuming from the blame and who committed this, I am actually not
>> super happy with that pinging for now. I was slightly supportive for this
>> idea but now I actually slightly
>> became negative on this after observing how it goes in practice.
>>
>> I wonder how other people think on this.
>>
>>
>>
>> 2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:
>>
>>> So CODEOWNERS is limited to committers by GitHub. We can definitely
>>> modify the config file though and I'm happy to write some custom logic if
>>> it helps support our needs. We can also just turn it off if it's too noisey
>>> for folks in general.
>>>
>>> That being said the folks being pinged are not just committers. The hope
>>> is to get more code authors who aren't committers involved in the reviews
>>> and then eventually become committers.
>>>
>>> On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:
>>>
 *reviewers: I mean people who committed the PR given my observation.

 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:

> I was wondering if we can leave the configuration open and accept some
> custom configurations, IMHO, because I saw some people less related or 
> less
> active are consistently pinged. Just started to get worried if they get
> annoyed by this.
> Also, some people could be interested in few specific areas. They
> should get pinged too.
> Also, assuming from people pinged, seems they are reviewers (which
> basically means committers I guess). Was wondering if there's a big
> difference between codeowners and bots.
>
>
>
> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>
>> Th configuration file is optional, is there something you want to try
>> and change?
>>
>> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
>> wrote:
>>
>>> I see. Thanks. I was wondering if I can see the configuration file
>>> since that looks needed (https://github.com/holdenk/
>>> mention-bot#configuration) but I couldn't find (sorry if it's just
>>> something I simply missed).
>>>
>>> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>>>
 So the one that is running is the the form in my own repo (set up
 for K8s deployment) - http://github.com/holdenk/mention-bot

 On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
 wrote:

> Holden, so, is it a fork in https://github.com/
> facebookarchive/mention-bot? Would you mind if I ask where I can
> see the configurations for it?
>
>
> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이
> 작성:
>
>> Yeah so the issue with codeowners is it will only assign to
>> committers on the repo (the Beam project found this out the practical
>> application way).
>>
>> I have a fork of mention bot running and it seems we can add it
>> (need an infra ticket), but one of the things the Beam folks asked 
>> was to
>> not ping code authors who haven’t committed in the past year which I 
>> need
>> to do a bit of poking on to make happen.
>>
>> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On this topic, I just stumbled on a GitHub feature called
>>> CODEOWNERS .
>>> It lets you specify owners of specific areas of the repository 
>>> using the
>>> same syntax that .gitignore uses. Here is CPython's CODEOWNERS
>>> file
>>> 
>

Re: Review notification bot

2018-07-30 Thread Reynold Xin
I like the idea of this bot, but I'm somewhat annoyed by it. I have touched
a lot of files and wrote a lot of the original code. Everyday I wake up I
get a lot of emails from this bot.

Also if we are going to use this, can we rename the bot to something like
spark-bot, rather than holden's personal bot?

On Mon, Jul 30, 2018 at 10:18 PM Hyukjin Kwon  wrote:

> > That being said the folks being pinged are not just committers.
>
> I doubt it because only pinged ones I see are all committers and that's
> why I assumed the pinging is based on who committed the PR (which implies
> committer only).
> Do you maybe have some examples where non-committers were pinged? Looks at
> least, (almost?) all of them are committers and something needs to be fixed 
> even
> if so.
>
> I recently argued about pinging things before - sounds it matters if it
> annoys. Since pinging is completely optional and cc'ing someone else might
> need other contexts not
> only assuming from the blame and who committed this, I am actually not
> super happy with that pinging for now. I was slightly supportive for this
> idea but now I actually slightly
> became negative on this after observing how it goes in practice.
>
> I wonder how other people think on this.
>
>
>
> 2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:
>
>> So CODEOWNERS is limited to committers by GitHub. We can definitely
>> modify the config file though and I'm happy to write some custom logic if
>> it helps support our needs. We can also just turn it off if it's too noisey
>> for folks in general.
>>
>> That being said the folks being pinged are not just committers. The hope
>> is to get more code authors who aren't committers involved in the reviews
>> and then eventually become committers.
>>
>> On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:
>>
>>> *reviewers: I mean people who committed the PR given my observation.
>>>
>>> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>>>
 I was wondering if we can leave the configuration open and accept some
 custom configurations, IMHO, because I saw some people less related or less
 active are consistently pinged. Just started to get worried if they get
 annoyed by this.
 Also, some people could be interested in few specific areas. They
 should get pinged too.
 Also, assuming from people pinged, seems they are reviewers (which
 basically means committers I guess). Was wondering if there's a big
 difference between codeowners and bots.



 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:

> Th configuration file is optional, is there something you want to try
> and change?
>
> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
> wrote:
>
>> I see. Thanks. I was wondering if I can see the configuration file
>> since that looks needed (
>> https://github.com/holdenk/mention-bot#configuration) but I couldn't
>> find (sorry if it's just something I simply missed).
>>
>> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>>
>>> So the one that is running is the the form in my own repo (set up
>>> for K8s deployment) - http://github.com/holdenk/mention-bot
>>>
>>> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Holden, so, is it a fork in
 https://github.com/facebookarchive/mention-bot? Would you mind if
 I ask where I can see the configurations for it?


 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이
 작성:

> Yeah so the issue with codeowners is it will only assign to
> committers on the repo (the Beam project found this out the practical
> application way).
>
> I have a fork of mention bot running and it seems we can add it
> (need an infra ticket), but one of the things the Beam folks asked 
> was to
> not ping code authors who haven’t committed in the past year which I 
> need
> to do a bit of poking on to make happen.
>
> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On this topic, I just stumbled on a GitHub feature called
>> CODEOWNERS .
>> It lets you specify owners of specific areas of the repository using 
>> the
>> same syntax that .gitignore uses. Here is CPython's CODEOWNERS
>> file
>> 
>> for reference.
>>
>> Dunno if that would complement mention-bot (which Facebook is
>> apparently no longer maintaining
>> ), or if
>> we can even use it given the ASF setup on GitHub. But I thought it 
>> would be
>> worth mentioning nonetheless.
>>
>> On Sat, Jul 14, 2018 at 11:17 AM Hol

Re: Review notification bot

2018-07-30 Thread Hyukjin Kwon
> That being said the folks being pinged are not just committers.

I doubt it because only pinged ones I see are all committers and that's why
I assumed the pinging is based on who committed the PR (which implies
committer only).
Do you maybe have some examples where non-committers were pinged? Looks at
least, (almost?) all of them are committers and something needs to be
fixed even
if so.

I recently argued about pinging things before - sounds it matters if it
annoys. Since pinging is completely optional and cc'ing someone else might
need other contexts not
only assuming from the blame and who committed this, I am actually not
super happy with that pinging for now. I was slightly supportive for this
idea but now I actually slightly
became negative on this after observing how it goes in practice.

I wonder how other people think on this.



2018년 7월 31일 (화) 오후 12:33, Holden Karau 님이 작성:

> So CODEOWNERS is limited to committers by GitHub. We can definitely modify
> the config file though and I'm happy to write some custom logic if it helps
> support our needs. We can also just turn it off if it's too noisey for
> folks in general.
>
> That being said the folks being pinged are not just committers. The hope
> is to get more code authors who aren't committers involved in the reviews
> and then eventually become committers.
>
> On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:
>
>> *reviewers: I mean people who committed the PR given my observation.
>>
>> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>>
>>> I was wondering if we can leave the configuration open and accept some
>>> custom configurations, IMHO, because I saw some people less related or less
>>> active are consistently pinged. Just started to get worried if they get
>>> annoyed by this.
>>> Also, some people could be interested in few specific areas. They should
>>> get pinged too.
>>> Also, assuming from people pinged, seems they are reviewers (which
>>> basically means committers I guess). Was wondering if there's a big
>>> difference between codeowners and bots.
>>>
>>>
>>>
>>> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>>>
 Th configuration file is optional, is there something you want to try
 and change?

 On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
 wrote:

> I see. Thanks. I was wondering if I can see the configuration file
> since that looks needed (
> https://github.com/holdenk/mention-bot#configuration) but I couldn't
> find (sorry if it's just something I simply missed).
>
> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>
>> So the one that is running is the the form in my own repo (set up for
>> K8s deployment) - http://github.com/holdenk/mention-bot
>>
>> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
>> wrote:
>>
>>> Holden, so, is it a fork in
>>> https://github.com/facebookarchive/mention-bot? Would you mind if I
>>> ask where I can see the configurations for it?
>>>
>>>
>>> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:
>>>
 Yeah so the issue with codeowners is it will only assign to
 committers on the repo (the Beam project found this out the practical
 application way).

 I have a fork of mention bot running and it seems we can add it
 (need an infra ticket), but one of the things the Beam folks asked was 
 to
 not ping code authors who haven’t committed in the past year which I 
 need
 to do a bit of poking on to make happen.

 On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> On this topic, I just stumbled on a GitHub feature called
> CODEOWNERS .
> It lets you specify owners of specific areas of the repository using 
> the
> same syntax that .gitignore uses. Here is CPython's CODEOWNERS
> file
> 
> for reference.
>
> Dunno if that would complement mention-bot (which Facebook is
> apparently no longer maintaining
> ), or if
> we can even use it given the ASF setup on GitHub. But I thought it 
> would be
> worth mentioning nonetheless.
>
> On Sat, Jul 14, 2018 at 11:17 AM Holden Karau <
> hol...@pigscanfly.ca> wrote:
>
>> Hearing no objections (and in a shout out to @ Nicholas Chammas
>> who initially suggested mention-bot back in 2016) I've set up a copy 
>> of
>> mention bot and run it against my own repo (looks like
>> https://github.com/holdenk/spark-testing-base/pull/253 ).
>>
>> If no one objects I’ll ask infra to turn this on for Spark on a
>> trial biases and we can revisit it b

Re: Review notification bot

2018-07-30 Thread Holden Karau
So CODEOWNERS is limited to committers by GitHub. We can definitely modify
the config file though and I'm happy to write some custom logic if it helps
support our needs. We can also just turn it off if it's too noisey for
folks in general.

That being said the folks being pinged are not just committers. The hope is
to get more code authors who aren't committers involved in the reviews and
then eventually become committers.

On Mon, Jul 30, 2018, 9:09 PM Hyukjin Kwon  wrote:

> *reviewers: I mean people who committed the PR given my observation.
>
> 2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:
>
>> I was wondering if we can leave the configuration open and accept some
>> custom configurations, IMHO, because I saw some people less related or less
>> active are consistently pinged. Just started to get worried if they get
>> annoyed by this.
>> Also, some people could be interested in few specific areas. They should
>> get pinged too.
>> Also, assuming from people pinged, seems they are reviewers (which
>> basically means committers I guess). Was wondering if there's a big
>> difference between codeowners and bots.
>>
>>
>>
>> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>>
>>> Th configuration file is optional, is there something you want to try
>>> and change?
>>>
>>> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon 
>>> wrote:
>>>
 I see. Thanks. I was wondering if I can see the configuration file
 since that looks needed (
 https://github.com/holdenk/mention-bot#configuration) but I couldn't
 find (sorry if it's just something I simply missed).

 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:

> So the one that is running is the the form in my own repo (set up for
> K8s deployment) - http://github.com/holdenk/mention-bot
>
> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
> wrote:
>
>> Holden, so, is it a fork in
>> https://github.com/facebookarchive/mention-bot? Would you mind if I
>> ask where I can see the configurations for it?
>>
>>
>> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:
>>
>>> Yeah so the issue with codeowners is it will only assign to
>>> committers on the repo (the Beam project found this out the practical
>>> application way).
>>>
>>> I have a fork of mention bot running and it seems we can add it
>>> (need an infra ticket), but one of the things the Beam folks asked was 
>>> to
>>> not ping code authors who haven’t committed in the past year which I 
>>> need
>>> to do a bit of poking on to make happen.
>>>
>>> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 On this topic, I just stumbled on a GitHub feature called
 CODEOWNERS .
 It lets you specify owners of specific areas of the repository using 
 the
 same syntax that .gitignore uses. Here is CPython's CODEOWNERS file
 
 for reference.

 Dunno if that would complement mention-bot (which Facebook is
 apparently no longer maintaining
 ), or if we
 can even use it given the ASF setup on GitHub. But I thought it would 
 be
 worth mentioning nonetheless.

 On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
 wrote:

> Hearing no objections (and in a shout out to @ Nicholas Chammas
> who initially suggested mention-bot back in 2016) I've set up a copy 
> of
> mention bot and run it against my own repo (looks like
> https://github.com/holdenk/spark-testing-base/pull/253 ).
>
> If no one objects I’ll ask infra to turn this on for Spark on a
> trial biases and we can revisit it based on how folks interact with 
> it.
>
> On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau <
> hol...@pigscanfly.ca> wrote:
>
>> So there are a few bots along this line in OSS. If no one objects
>> I’ll take a look and find one which matches our use case and try it 
>> out.
>>
>> On Wed, Jun 6, 2018 at 10:33 AM Sean Owen 
>> wrote:
>>
>>> Certainly I will frequently dig through 'git blame' to figure
>>> out who might be the right reviewer. Maybe that's automatable -- 
>>> ping the
>>> person who last touched the most lines touched by the PR? There 
>>> might be
>>> some false positives there. And I suppose the downside is being 
>>> pinged
>>> forever for some change that just isn't well considered or one of 
>>> those
>>> accidental 100K-line PRs. So maybe some way to decline or silence is
>>> important, or maybe just ping once and leave it. 

Re: Review notification bot

2018-07-30 Thread Hyukjin Kwon
*reviewers: I mean people who committed the PR given my observation.

2018년 7월 31일 (화) 오전 11:50, Hyukjin Kwon 님이 작성:

> I was wondering if we can leave the configuration open and accept some
> custom configurations, IMHO, because I saw some people less related or less
> active are consistently pinged. Just started to get worried if they get
> annoyed by this.
> Also, some people could be interested in few specific areas. They should
> get pinged too.
> Also, assuming from people pinged, seems they are reviewers (which
> basically means committers I guess). Was wondering if there's a big
> difference between codeowners and bots.
>
>
>
> 2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:
>
>> Th configuration file is optional, is there something you want to try and
>> change?
>>
>> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon  wrote:
>>
>>> I see. Thanks. I was wondering if I can see the configuration file since
>>> that looks needed (https://github.com/holdenk/mention-bot#configuration)
>>> but I couldn't find (sorry if it's just something I simply missed).
>>>
>>> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>>>
 So the one that is running is the the form in my own repo (set up for
 K8s deployment) - http://github.com/holdenk/mention-bot

 On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
 wrote:

> Holden, so, is it a fork in
> https://github.com/facebookarchive/mention-bot? Would you mind if I
> ask where I can see the configurations for it?
>
>
> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:
>
>> Yeah so the issue with codeowners is it will only assign to
>> committers on the repo (the Beam project found this out the practical
>> application way).
>>
>> I have a fork of mention bot running and it seems we can add it (need
>> an infra ticket), but one of the things the Beam folks asked was to not
>> ping code authors who haven’t committed in the past year which I need to 
>> do
>> a bit of poking on to make happen.
>>
>> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On this topic, I just stumbled on a GitHub feature called CODEOWNERS
>>> . It lets you
>>> specify owners of specific areas of the repository using the same syntax
>>> that .gitignore uses. Here is CPython's CODEOWNERS file
>>> 
>>> for reference.
>>>
>>> Dunno if that would complement mention-bot (which Facebook is
>>> apparently no longer maintaining
>>> ), or if we
>>> can even use it given the ASF setup on GitHub. But I thought it would be
>>> worth mentioning nonetheless.
>>>
>>> On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
>>> wrote:
>>>
 Hearing no objections (and in a shout out to @ Nicholas Chammas who
 initially suggested mention-bot back in 2016) I've set up a copy of 
 mention
 bot and run it against my own repo (looks like
 https://github.com/holdenk/spark-testing-base/pull/253 ).

 If no one objects I’ll ask infra to turn this on for Spark on a
 trial biases and we can revisit it based on how folks interact with it.

 On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau >>> > wrote:

> So there are a few bots along this line in OSS. If no one objects
> I’ll take a look and find one which matches our use case and try it 
> out.
>
> On Wed, Jun 6, 2018 at 10:33 AM Sean Owen 
> wrote:
>
>> Certainly I will frequently dig through 'git blame' to figure out
>> who might be the right reviewer. Maybe that's automatable -- ping the
>> person who last touched the most lines touched by the PR? There 
>> might be
>> some false positives there. And I suppose the downside is being 
>> pinged
>> forever for some change that just isn't well considered or one of 
>> those
>> accidental 100K-line PRs. So maybe some way to decline or silence is
>> important, or maybe just ping once and leave it. Sure, a bot that 
>> just adds
>> a "Would @foo like to review?" comment on Github? Sure seems worth 
>> trying
>> if someone is willing to do the work to cook up the bot.
>>
>> On Wed, Jun 6, 2018 at 12:22 PM Holden Karau <
>> hol...@pigscanfly.ca> wrote:
>>
>>> Hi friends,
>>>
>>> Was chatting with some folks at the summit and I was wondering
>>> how people would feel about adding a review bot to ping folks. We 
>>> already
>>> have the review dashboard but I was thinking we could ping folks 
>>> who were
>>> the original authors of

Re: Review notification bot

2018-07-30 Thread Hyukjin Kwon
I was wondering if we can leave the configuration open and accept some
custom configurations, IMHO, because I saw some people less related or less
active are consistently pinged. Just started to get worried if they get
annoyed by this.
Also, some people could be interested in few specific areas. They should
get pinged too.
Also, assuming from people pinged, seems they are reviewers (which
basically means committers I guess). Was wondering if there's a big
difference between codeowners and bots.



2018년 7월 31일 (화) 오전 11:38, Holden Karau 님이 작성:

> Th configuration file is optional, is there something you want to try and
> change?
>
> On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon  wrote:
>
>> I see. Thanks. I was wondering if I can see the configuration file since
>> that looks needed (https://github.com/holdenk/mention-bot#configuration)
>> but I couldn't find (sorry if it's just something I simply missed).
>>
>> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>>
>>> So the one that is running is the the form in my own repo (set up for
>>> K8s deployment) - http://github.com/holdenk/mention-bot
>>>
>>> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Holden, so, is it a fork in
 https://github.com/facebookarchive/mention-bot? Would you mind if I
 ask where I can see the configurations for it?


 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:

> Yeah so the issue with codeowners is it will only assign to committers
> on the repo (the Beam project found this out the practical application 
> way).
>
> I have a fork of mention bot running and it seems we can add it (need
> an infra ticket), but one of the things the Beam folks asked was to not
> ping code authors who haven’t committed in the past year which I need to 
> do
> a bit of poking on to make happen.
>
> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On this topic, I just stumbled on a GitHub feature called CODEOWNERS
>> . It lets you
>> specify owners of specific areas of the repository using the same syntax
>> that .gitignore uses. Here is CPython's CODEOWNERS file
>> 
>> for reference.
>>
>> Dunno if that would complement mention-bot (which Facebook is
>> apparently no longer maintaining
>> ), or if we
>> can even use it given the ASF setup on GitHub. But I thought it would be
>> worth mentioning nonetheless.
>>
>> On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
>> wrote:
>>
>>> Hearing no objections (and in a shout out to @ Nicholas Chammas who
>>> initially suggested mention-bot back in 2016) I've set up a copy of 
>>> mention
>>> bot and run it against my own repo (looks like
>>> https://github.com/holdenk/spark-testing-base/pull/253 ).
>>>
>>> If no one objects I’ll ask infra to turn this on for Spark on a
>>> trial biases and we can revisit it based on how folks interact with it.
>>>
>>> On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau 
>>> wrote:
>>>
 So there are a few bots along this line in OSS. If no one objects
 I’ll take a look and find one which matches our use case and try it 
 out.

 On Wed, Jun 6, 2018 at 10:33 AM Sean Owen  wrote:

> Certainly I will frequently dig through 'git blame' to figure out
> who might be the right reviewer. Maybe that's automatable -- ping the
> person who last touched the most lines touched by the PR? There might 
> be
> some false positives there. And I suppose the downside is being pinged
> forever for some change that just isn't well considered or one of 
> those
> accidental 100K-line PRs. So maybe some way to decline or silence is
> important, or maybe just ping once and leave it. Sure, a bot that 
> just adds
> a "Would @foo like to review?" comment on Github? Sure seems worth 
> trying
> if someone is willing to do the work to cook up the bot.
>
> On Wed, Jun 6, 2018 at 12:22 PM Holden Karau 
> wrote:
>
>> Hi friends,
>>
>> Was chatting with some folks at the summit and I was wondering
>> how people would feel about adding a review bot to ping folks. We 
>> already
>> have the review dashboard but I was thinking we could ping folks who 
>> were
>> the original authors of the code being changed whom might not be in 
>> the
>> habit of looking at the review dashboard.
>>
>> Cheers,
>>
>> Holden :)
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
 Twitter

Re: Review notification bot

2018-07-30 Thread Holden Karau
Th configuration file is optional, is there something you want to try and
change?

On Mon, Jul 30, 2018 at 7:30 PM Hyukjin Kwon  wrote:

> I see. Thanks. I was wondering if I can see the configuration file since
> that looks needed (https://github.com/holdenk/mention-bot#configuration)
> but I couldn't find (sorry if it's just something I simply missed).
>
> 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:
>
>> So the one that is running is the the form in my own repo (set up for K8s
>> deployment) - http://github.com/holdenk/mention-bot
>>
>> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon  wrote:
>>
>>> Holden, so, is it a fork in
>>> https://github.com/facebookarchive/mention-bot? Would you mind if I ask
>>> where I can see the configurations for it?
>>>
>>>
>>> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:
>>>
 Yeah so the issue with codeowners is it will only assign to committers
 on the repo (the Beam project found this out the practical application 
 way).

 I have a fork of mention bot running and it seems we can add it (need
 an infra ticket), but one of the things the Beam folks asked was to not
 ping code authors who haven’t committed in the past year which I need to do
 a bit of poking on to make happen.

 On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> On this topic, I just stumbled on a GitHub feature called CODEOWNERS
> . It lets you
> specify owners of specific areas of the repository using the same syntax
> that .gitignore uses. Here is CPython's CODEOWNERS file
> 
> for reference.
>
> Dunno if that would complement mention-bot (which Facebook is
> apparently no longer maintaining
> ), or if we
> can even use it given the ASF setup on GitHub. But I thought it would be
> worth mentioning nonetheless.
>
> On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
> wrote:
>
>> Hearing no objections (and in a shout out to @ Nicholas Chammas who
>> initially suggested mention-bot back in 2016) I've set up a copy of 
>> mention
>> bot and run it against my own repo (looks like
>> https://github.com/holdenk/spark-testing-base/pull/253 ).
>>
>> If no one objects I’ll ask infra to turn this on for Spark on a trial
>> biases and we can revisit it based on how folks interact with it.
>>
>> On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau 
>> wrote:
>>
>>> So there are a few bots along this line in OSS. If no one objects
>>> I’ll take a look and find one which matches our use case and try it out.
>>>
>>> On Wed, Jun 6, 2018 at 10:33 AM Sean Owen  wrote:
>>>
 Certainly I will frequently dig through 'git blame' to figure out
 who might be the right reviewer. Maybe that's automatable -- ping the
 person who last touched the most lines touched by the PR? There might 
 be
 some false positives there. And I suppose the downside is being pinged
 forever for some change that just isn't well considered or one of those
 accidental 100K-line PRs. So maybe some way to decline or silence is
 important, or maybe just ping once and leave it. Sure, a bot that just 
 adds
 a "Would @foo like to review?" comment on Github? Sure seems worth 
 trying
 if someone is willing to do the work to cook up the bot.

 On Wed, Jun 6, 2018 at 12:22 PM Holden Karau 
 wrote:

> Hi friends,
>
> Was chatting with some folks at the summit and I was wondering how
> people would feel about adding a review bot to ping folks. We already 
> have
> the review dashboard but I was thinking we could ping folks who were 
> the
> original authors of the code being changed whom might not be in the 
> habit
> of looking at the review dashboard.
>
> Cheers,
>
> Holden :)
> --
> Twitter: https://twitter.com/holdenkarau
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
 Twitter: https://twitter.com/holdenkarau

>>> --
Twitter: https://twitter.com/holdenkarau


Re: Review notification bot

2018-07-30 Thread Hyukjin Kwon
I see. Thanks. I was wondering if I can see the configuration file since
that looks needed (https://github.com/holdenk/mention-bot#configuration)
but I couldn't find (sorry if it's just something I simply missed).

2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성:

> So the one that is running is the the form in my own repo (set up for K8s
> deployment) - http://github.com/holdenk/mention-bot
>
> On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon  wrote:
>
>> Holden, so, is it a fork in
>> https://github.com/facebookarchive/mention-bot? Would you mind if I ask
>> where I can see the configurations for it?
>>
>>
>> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:
>>
>>> Yeah so the issue with codeowners is it will only assign to committers
>>> on the repo (the Beam project found this out the practical application way).
>>>
>>> I have a fork of mention bot running and it seems we can add it (need an
>>> infra ticket), but one of the things the Beam folks asked was to not ping
>>> code authors who haven’t committed in the past year which I need to do a
>>> bit of poking on to make happen.
>>>
>>> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 On this topic, I just stumbled on a GitHub feature called CODEOWNERS
 . It lets you
 specify owners of specific areas of the repository using the same syntax
 that .gitignore uses. Here is CPython's CODEOWNERS file
  for
 reference.

 Dunno if that would complement mention-bot (which Facebook is
 apparently no longer maintaining
 ), or if we can
 even use it given the ASF setup on GitHub. But I thought it would be worth
 mentioning nonetheless.

 On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
 wrote:

> Hearing no objections (and in a shout out to @ Nicholas Chammas who
> initially suggested mention-bot back in 2016) I've set up a copy of 
> mention
> bot and run it against my own repo (looks like
> https://github.com/holdenk/spark-testing-base/pull/253 ).
>
> If no one objects I’ll ask infra to turn this on for Spark on a trial
> biases and we can revisit it based on how folks interact with it.
>
> On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau 
> wrote:
>
>> So there are a few bots along this line in OSS. If no one objects
>> I’ll take a look and find one which matches our use case and try it out.
>>
>> On Wed, Jun 6, 2018 at 10:33 AM Sean Owen  wrote:
>>
>>> Certainly I will frequently dig through 'git blame' to figure out
>>> who might be the right reviewer. Maybe that's automatable -- ping the
>>> person who last touched the most lines touched by the PR? There might be
>>> some false positives there. And I suppose the downside is being pinged
>>> forever for some change that just isn't well considered or one of those
>>> accidental 100K-line PRs. So maybe some way to decline or silence is
>>> important, or maybe just ping once and leave it. Sure, a bot that just 
>>> adds
>>> a "Would @foo like to review?" comment on Github? Sure seems worth 
>>> trying
>>> if someone is willing to do the work to cook up the bot.
>>>
>>> On Wed, Jun 6, 2018 at 12:22 PM Holden Karau 
>>> wrote:
>>>
 Hi friends,

 Was chatting with some folks at the summit and I was wondering how
 people would feel about adding a review bot to ping folks. We already 
 have
 the review dashboard but I was thinking we could ping folks who were 
 the
 original authors of the code being changed whom might not be in the 
 habit
 of looking at the review dashboard.

 Cheers,

 Holden :)
 --
 Twitter: https://twitter.com/holdenkarau

>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> --
> Twitter: https://twitter.com/holdenkarau
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>


Re: [Spark SQL] Future of CalendarInterval

2018-07-30 Thread Hyukjin Kwon
FYI, org.apache.spark.unsafe.types.CalendarInterval is undocumented in both
scaladoc/javadoc (entire unsafe module)
but org.apache.spark.sql.types.CalendarIntervalType is exposed (
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.CalendarIntervalType
)

+1 for starting the discussion after 2.4.0. I would suggest defer, as I
said in the PR again.

2018년 7월 29일 (일) 오후 6:58, Daniel Mateus Pires 님이 작성:

> Sounds good! @Xiao
>
> @Reynold AFAIK the only data type that is valid to cast to Calendar
> Interval is VARCHAR
>
> here is Postgres:
>
> postgres=# select CAST(CAST(interval '1 hour' AS varchar) AS interval);
>  interval
> --
>  01:00:00
> (1 row)
>
> (snippet comes from the JIRA)
>
> Thanks,
>
> Daniel
>
>
> On 27 July 2018 at 20:38, Xiao Li  wrote:
>
>> The code freeze of the upcoming release Spark 2.4 is very close. How
>> about revisiting this and explicitly defining the support scope
>> of CalendarIntervalType in the next release (Spark 3.0)?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> 2018-07-27 10:45 GMT-07:00 Reynold Xin :
>>
>>> CalendarInterval is definitely externally visible.
>>>
>>> E.g. sql("select interval 1 day").dtypes would return "Array[(String,
>>> String)] = Array((interval 1 days,CalendarIntervalType))"
>>>
>>> However, I'm not sure what it means to support casting. What are the
>>> semantics for casting from any other data type to calendar interval? I can
>>> see string casting and casting from itself, but not any other data types.
>>>
>>>
>>>
>>>
>>> On Fri, Jul 27, 2018 at 10:34 AM Daniel Mateus Pires 
>>> wrote:
>>>
 Hi Sparkers! (maybe Sparkles ?)

 I just wanted to bring up the apparently ?controversial? Calendar
 Interval topic.

 I worked on: https://issues.apache.org/jira/browse/SPARK-24702,
 https://github.com/apache/spark/pull/21706

 The user was reporting an unexpected behaviour where he/she wasn’t able
 to cast to a Calendar Interval type.

 In the current version of Spark the following code works:

 scala> spark.sql("SELECT 'interval 1 hour' as 
 a").select(col("a").cast("calendarinterval")).show()++|
a|++|interval 1 hours|++


 While the following doesn’t:
 spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()


 Since the DataFrame API equivalent of the SQL worked, I thought adding
 it would be an easy decision to make (to make it consistent)

 However, I got push-back on the PR on the basis that “*we do not plan
 to expose Calendar Interval as a public type*”
 Should there be a consensus on either cleaning up the public DataFrame
 API out of CalendarIntervalType OR making it consistent with the SQL ?

 --
 Best regards,
 Daniel Mateus Pires
 Data Engineer @ Hudson's Bay Company

>>>
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Wenchen Fan
I went through the open JIRA tickets and here is a list that we should
consider for Spark 2.4:

*High Priority*:
SPARK-24374 : Support
Barrier Execution Mode in Apache Spark
This one is critical to the Spark ecosystem for deep learning. It only has
a few remaining works and I think we should have it in Spark 2.4.

*Middle Priority*:
SPARK-23899 : Built-in
SQL Function Improvement
We've already added a lot of built-in functions in this release, but there
are a few useful higher-order functions in progress, like `array_except`,
`transform`, etc. It would be great if we can get them in Spark 2.4.

SPARK-14220 : Build and
test Spark against Scala 2.12
Very close to finishing, great to have it in Spark 2.4.

SPARK-4502 : Spark SQL
reads unnecessary nested fields from Parquet
This one is there for years (thanks for your patience Michael!), and is
also close to finishing. Great to have it in 2.4.

SPARK-24882 : data
source v2 API improvement
This is to improve the data source v2 API based on what we learned during
this release. From the migration of existing sources and design of new
features, we found some problems in the API and want to address them. I
believe this should be the last significant API change to data source
v2, so great to have in Spark 2.4. I'll send a discuss email about it later.

SPARK-24252 : Add
catalog support in Data Source V2
This is a very important feature for data source v2, and is currently being
discussed in the dev list.

SPARK-24768 : Have a
built-in AVRO data source implementation
Most of it is done, but date/timestamp support is still missing. Great to
have in 2.4.

SPARK-23243 :
Shuffle+Repartition on an RDD could lead to incorrect answers
This is a long-standing correctness bug, great to have in 2.4.

There are some other important features like the adaptive execution,
streaming SQL, etc., not in the list, since I think we are not able to
finish them before 2.4.

Feel free to add more things if you think they are important to Spark 2.4
by replying to this email.

Thanks,
Wenchen

On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:

> In theory releases happen on a time-based cadence, so it's pretty much
> wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the code freeze out a few weeks every time. So, kind
> of a hybrid approach here that works OK.
>
> Certainly speak up if you think there's something that really needs to get
> into 2.4. This is that discuss thread.
>
> (BTW I updated the page you mention just yesterday, to reflect the plan
> suggested in this thread.)
>
> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
> wrote:
>
>> Shouldn't this be a discuss thread?
>>
>> I'm also happy to see more release managers and agree the time is getting
>> close, but we should see what features are in progress and see how close
>> things are and propose a date based on that.  Cutting a branch to soon just
>> creates more work for committers to push to more branches.
>>
>>  http://spark.apache.org/versioning-policy.html mentioned the code
>> freeze and release branch cut mid-august.
>>
>>
>> Tom
>>
>>


[build system] two workers will be reimaged w/ubuntu tomorrow

2018-07-30 Thread shane knapp
my testing is going really well, and i think we're --->this<--- close to
porting all of the spark builds to ubuntu!

TL;DR:   i am NOT planning on moving all builds to centos until after
august 8th.  i WOULD like to move the PRB to ubuntu before then.

anyways:

once these two smoke test builds pass, i'll be somewhere @ ~99% certainty
that we can just move all of the spark builds (except docs) en masse!

https://amplab.cs.berkeley.edu/jenkins/job/ubuntuSparkPRB/59/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/887/

we're starting to run in to build node availability bottlenecks for some of
our lab builds, so we will be reinstalling the following two workers
w/ubuntu tomorrow:
amp-jenkins-worker-07
amp-jenkins-worker-08

this will give us some breathing room, and we shouldn't have any noticeable
impact on the spark builds.

the biggest worry is the pull request builder...  i'll update that job and
let a couple of PRBs run through before moving it back to the centos hosts.

if the regular PRB builds pass happily, this will let us move it from
centos to ubuntu *before* 2.4 is cut, and gives us two things:

1) unblock https://github.com/apache/spark/pull/21584
2) stop needing 2 builds for pull requests (one for regular tests on
centos, one to test against minikube on ubuntu).

questions/comments/concerns?

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Review notification bot

2018-07-30 Thread Holden Karau
So the one that is running is the the form in my own repo (set up for K8s
deployment) - http://github.com/holdenk/mention-bot

On Mon, Jul 30, 2018 at 3:15 AM Hyukjin Kwon  wrote:

> Holden, so, is it a fork in https://github.com/facebookarchive/mention-bot?
> Would you mind if I ask where I can see the configurations for it?
>
>
> 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:
>
>> Yeah so the issue with codeowners is it will only assign to committers on
>> the repo (the Beam project found this out the practical application way).
>>
>> I have a fork of mention bot running and it seems we can add it (need an
>> infra ticket), but one of the things the Beam folks asked was to not ping
>> code authors who haven’t committed in the past year which I need to do a
>> bit of poking on to make happen.
>>
>> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On this topic, I just stumbled on a GitHub feature called CODEOWNERS
>>> . It lets you
>>> specify owners of specific areas of the repository using the same syntax
>>> that .gitignore uses. Here is CPython's CODEOWNERS file
>>>  for
>>> reference.
>>>
>>> Dunno if that would complement mention-bot (which Facebook is apparently no
>>> longer maintaining
>>> ), or if we can
>>> even use it given the ASF setup on GitHub. But I thought it would be worth
>>> mentioning nonetheless.
>>>
>>> On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
>>> wrote:
>>>
 Hearing no objections (and in a shout out to @ Nicholas Chammas who
 initially suggested mention-bot back in 2016) I've set up a copy of mention
 bot and run it against my own repo (looks like https://github.com/
 holdenk/spark-testing-base/pull/253 ).

 If no one objects I’ll ask infra to turn this on for Spark on a trial
 biases and we can revisit it based on how folks interact with it.

 On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau 
 wrote:

> So there are a few bots along this line in OSS. If no one objects I’ll
> take a look and find one which matches our use case and try it out.
>
> On Wed, Jun 6, 2018 at 10:33 AM Sean Owen  wrote:
>
>> Certainly I will frequently dig through 'git blame' to figure out who
>> might be the right reviewer. Maybe that's automatable -- ping the person
>> who last touched the most lines touched by the PR? There might be some
>> false positives there. And I suppose the downside is being pinged forever
>> for some change that just isn't well considered or one of those 
>> accidental
>> 100K-line PRs. So maybe some way to decline or silence is important, or
>> maybe just ping once and leave it. Sure, a bot that just adds a "Would 
>> @foo
>> like to review?" comment on Github? Sure seems worth trying if someone is
>> willing to do the work to cook up the bot.
>>
>> On Wed, Jun 6, 2018 at 12:22 PM Holden Karau 
>> wrote:
>>
>>> Hi friends,
>>>
>>> Was chatting with some folks at the summit and I was wondering how
>>> people would feel about adding a review bot to ping folks. We already 
>>> have
>>> the review dashboard but I was thinking we could ping folks who were the
>>> original authors of the code being changed whom might not be in the 
>>> habit
>>> of looking at the review dashboard.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>



 --
 Twitter: https://twitter.com/holdenkarau
 --
 Twitter: https://twitter.com/holdenkarau

>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>


Re: Why percentile and distinct are not done in one job?

2018-07-30 Thread Reynold Xin
Which API are you talking about?

On Mon, Jul 30, 2018 at 7:03 AM 吴晓菊  wrote:

> I noticed that in column analyzing, 2 jobs will run separately to
> calculate percentiles and then distinct. Why not combine into one job since
> HyperLogLog also supports merge?
>
> Chrysan Wu
> Phone:+86 17717640807
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Sean Owen
In theory releases happen on a time-based cadence, so it's pretty much wrap
up what's ready by the code freeze and ship it. In practice, the cadence
slips frequently, and it's very much a negotiation about what features
should push the code freeze out a few weeks every time. So, kind of a
hybrid approach here that works OK.

Certainly speak up if you think there's something that really needs to get
into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan
suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
wrote:

> Shouldn't this be a discuss thread?
>
> I'm also happy to see more release managers and agree the time is getting
> close, but we should see what features are in progress and see how close
> things are and propose a date based on that.  Cutting a branch to soon just
> creates more work for committers to push to more branches.
>
>  http://spark.apache.org/versioning-policy.html mentioned the code freeze
> and release branch cut mid-august.
>
>
> Tom
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Tom Graves
 Shouldn't this be a discuss thread?  
I'm also happy to see more release managers and agree the time is getting 
close, but we should see what features are in progress and see how close things 
are and propose a date based on that.  Cutting a branch to soon just creates 
more work for committers to push to more branches. 
 http://spark.apache.org/versioning-policy.html mentioned the code freeze and 
release branch cut mid-august.

Tom
On Friday, July 6, 2018, 11:47:35 AM CDT, Reynold Xin  
wrote:  
 
 FYI 6 mo is coming up soon since the last release. We will cut the branch and 
code freeze on Aug 1st in order to get 2.4 out on time.
  

Why percentile and distinct are not done in one job?

2018-07-30 Thread 吴晓菊
I noticed that in column analyzing, 2 jobs will run separately to calculate
percentiles and then distinct. Why not combine into one job since
HyperLogLog also supports merge?

Chrysan Wu
Phone:+86 17717640807


Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-30 Thread Wenchen Fan
Another two correctness bug fixes were merged to 2.3 today:
https://issues.apache.org/jira/browse/SPARK-24934
https://issues.apache.org/jira/browse/SPARK-24957

On Mon, Jul 30, 2018 at 1:19 PM Xiao Li  wrote:

> Sounds good to me. Thanks! Today, we merged another correctness fix
> https://github.com/apache/spark/pull/21772.
>
> Xiao
>
> 2018-07-29 18:31 GMT-07:00 Saisai Shao :
>
>> Sure, I will do a next RC. I'm still waiting for a CVE fix, if this can
>> be done in this two days, I will also include that one.
>>
>> Xiao Li  于2018年7月28日周六 上午12:05写道:
>>
>>> The following blocker/important fixes have been merged to Spark 2.3
>>> branch:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24927
>>> https://issues.apache.org/jira/browse/SPARK-24867
>>> https://issues.apache.org/jira/browse/SPARK-24891
>>>
>>> *Saisai*, could you start the next RC?
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>> 2018-07-20 14:21 GMT-07:00 Tom Graves :
>>>
 fyi, I merged in a couple jira that were critical (and I thought would
 be good to include in the next release) that if we spin another RC will get
 included, we should update the jira SPARK-24755
 
  and SPARK-24677
 ,
 if anyone disagrees we could back those out but I think they would be good
 to include.

 Tom

 On Thursday, July 19, 2018, 8:13:23 PM CDT, Saisai Shao <
 sai.sai.s...@gmail.com> wrote:


 Sure, I can wait for this and create another RC then.

 Thanks,
 Saisai

 Xiao Li  于2018年7月20日周五 上午9:11写道:

 Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I
 created. The PR has been created. Since this is not rare, let us merge it
 to 2.3.2?

 Reynold' PR is to get rid of AnalysisBarrier. That is better than
 multiple patches we added for AnalysisBarrier after 2.3.0 release. We can
 target it to 2.4.

 Thanks,

 Xiao

 2018-07-19 17:48 GMT-07:00 Saisai Shao :

 I see, thanks Reynold.

 Reynold Xin  于2018年7月20日周五 上午8:46写道:

 Looking at the list of pull requests it looks like this is the ticket:
 https://issues.apache.org/jira/browse/SPARK-24867



 On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin 
 wrote:

 I don't think my ticket should block this release. It's a big general
 refactoring.

 Xiao do you have a ticket for the bug you found?


 On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao 
 wrote:

 Hi Xiao,

 Are you referring to this JIRA (
 https://issues.apache.org/jira/browse/SPARK-24865)?

 Xiao Li  于2018年7月20日周五 上午2:41写道:

 dfWithUDF.cache()
 dfWithUDF.write.saveAsTable("t")
 dfWithUDF.write.saveAsTable("t1")


 Cached data is not being used. It causes a big performance regression.




 2018-07-19 11:32 GMT-07:00 Sean Owen :

 What regression are you referring to here? A -1 vote really needs a
 rationale.

 On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:

 I would first vote -1.

 I might find another regression caused by the analysis barrier. Will
 keep you posted.




>>>
>


Re: Review notification bot

2018-07-30 Thread Hyukjin Kwon
Holden, so, is it a fork in https://github.com/facebookarchive/mention-bot?
Would you mind if I ask where I can see the configurations for it?


2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성:

> Yeah so the issue with codeowners is it will only assign to committers on
> the repo (the Beam project found this out the practical application way).
>
> I have a fork of mention bot running and it seems we can add it (need an
> infra ticket), but one of the things the Beam folks asked was to not ping
> code authors who haven’t committed in the past year which I need to do a
> bit of poking on to make happen.
>
> On Sun, Jul 22, 2018 at 7:04 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On this topic, I just stumbled on a GitHub feature called CODEOWNERS
>> . It lets you
>> specify owners of specific areas of the repository using the same syntax
>> that .gitignore uses. Here is CPython's CODEOWNERS file
>>  for
>> reference.
>>
>> Dunno if that would complement mention-bot (which Facebook is apparently no
>> longer maintaining
>> ), or if we can
>> even use it given the ASF setup on GitHub. But I thought it would be worth
>> mentioning nonetheless.
>>
>> On Sat, Jul 14, 2018 at 11:17 AM Holden Karau 
>> wrote:
>>
>>> Hearing no objections (and in a shout out to @ Nicholas Chammas who
>>> initially suggested mention-bot back in 2016) I've set up a copy of mention
>>> bot and run it against my own repo (looks like
>>> https://github.com/holdenk/spark-testing-base/pull/253 ).
>>>
>>> If no one objects I’ll ask infra to turn this on for Spark on a trial
>>> biases and we can revisit it based on how folks interact with it.
>>>
>>> On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau 
>>> wrote:
>>>
 So there are a few bots along this line in OSS. If no one objects I’ll
 take a look and find one which matches our use case and try it out.

 On Wed, Jun 6, 2018 at 10:33 AM Sean Owen  wrote:

> Certainly I will frequently dig through 'git blame' to figure out who
> might be the right reviewer. Maybe that's automatable -- ping the person
> who last touched the most lines touched by the PR? There might be some
> false positives there. And I suppose the downside is being pinged forever
> for some change that just isn't well considered or one of those accidental
> 100K-line PRs. So maybe some way to decline or silence is important, or
> maybe just ping once and leave it. Sure, a bot that just adds a "Would 
> @foo
> like to review?" comment on Github? Sure seems worth trying if someone is
> willing to do the work to cook up the bot.
>
> On Wed, Jun 6, 2018 at 12:22 PM Holden Karau 
> wrote:
>
>> Hi friends,
>>
>> Was chatting with some folks at the summit and I was wondering how
>> people would feel about adding a review bot to ping folks. We already 
>> have
>> the review dashboard but I was thinking we could ping folks who were the
>> original authors of the code being changed whom might not be in the habit
>> of looking at the review dashboard.
>>
>> Cheers,
>>
>> Holden :)
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>