[ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-14 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.2.3!

Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to
upgrade to this stable release.

To download Spark 2.2.3, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-2-3.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Bests,
Dongjoon.


Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Jungtaek Lim
Yes I understand what Reynold stated (as Michael Armbrust stated earlier),
and I agree it's major great thing that improvements on CORE/SQL also
benefit to SS as well.

I just concerned that both of SQL / SS are being impacted with DSv2, but
things are going differently between SQL and SS. SQL is still active for
contributions happening which are not relevant to DSv2, SS doesn't seem to.
I wish we have small time slot to keep SS active (not expecting as SQL, but
review in time before author of PRs leave).

2019년 1월 15일 (화) 오전 11:00, JackyLee 님이 작성:

> Agree with rxin. Maybe we should consider about these PRs, especially those
> large PRs, after DataSource V2 API is ready.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread JackyLee
Agree with rxin. Maybe we should consider about these PRs, especially those
large PRs, after DataSource V2 API is ready.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Nicholas Chammas
OK, good to know, and that all makes sense. Thanks for clearing up my
concern.

One of great things about Spark is, as you pointed out, that improvements
to core components benefit multiple features at once.

On Mon, Jan 14, 2019 at 8:36 PM Reynold Xin  wrote:

> BTW the largest change to SS right now is probably the entire data source
> API v2 effort, which aims to unify streaming and batch from data source
> perspective, and provide a reliable, expressive source/sink API.
>
>
> On Mon, Jan 14, 2019 at 5:34 PM, Reynold Xin  wrote:
>
>> There are a few things to keep in mind:
>>
>> 1. Structured Streaming isn't an independent project. It actually (by
>> design) depends on all the rest of Spark SQL, and virtually all
>> improvements to Spark SQL benefit Structured Streaming.
>>
>> 2. The project as far as I can tell is relatively mature for core ETL and
>> incremental processing purpose. I interact with a lot of users using it
>> everyday. We can always expand the use cases and add more, but that also
>> adds maintenance burden. In any case, it'd be good to get some activity
>> here.
>>
>>
>>
>>
>> On Mon, Jan 14, 2019 at 5:11 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> As an observer, this thread is interesting and concerning. Is there an
>>> emerging consensus that Structured Streaming is somehow not relevant
>>> anymore? Or is it just that folks consider it "complete enough"?
>>>
>>> Structured Streaming was billed as the replacement to DStreams. If
>>> committers, generally speaking, have lost interest in Structured Streaming,
>>> does that mean the Apache Spark project is somehow no longer aiming to
>>> provide a "first-class" solution to the problem of stream processing?
>>>
>>> On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim  wrote:
>>>
 Cody, I guess I already addressed your comments in the PR (#22138). The
 approach was changed to address your concern, and after that Gabor helped
 to review the PR. Please take a look again when you have time to get into.

 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger 님이 작성:

> I feel like I've already said my piece on
> https://github.com/apache/spark/pull/22138 let me know if you have
> more questions.
>
> As for SS in general, I don't have a production SS deployment, so I'm
> less comfortable with reviewing large changes to it.  But if no other
> committers are working on it...
>
> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
> >
> > Yes you're preaching to the choir here. SS does seem somewhat
> > abandoned by those that have worked on it. I have also been at times
> > frustrated that some areas fall into this pattern.
> >
> > There isn't a way to make people work on it, and I personally am not
> > interested in it nor have a background in SS.
> >
> > I did leave some comments on your PR and will see if we can get
> > comfortable with merging it, as I presume you are pretty
> knowledgeable
> > about the change.
> >
> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim 
> wrote:
> > >
> > > Sean, this is actually a fail-back on pinging committers. I know
> who can review and merge in SS area, and pinged to them, didn't work. Even
> there's a PR which approach was encouraged by committer and reviewed the
> first phase, and no review.
> > >
> > > That's not the first time I have faced the situation, and I used
> the fail-back approach at that time. (You can see there was no response
> even in the mail thread.) Not sure which approach worked.
> > >
> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
> > >
> > > I've observed that only (critical) bugfixes are being reviewed and
> merged in time for SS area. For other stuffs like new features and
> improvements, both discussions and PRs were pretty less popular from
> committers: though there was even participation/approve from non-committer
> community. I don't think SS is the thing to be turned into maintenance.
> > >
> > > I guess PMC members should try to resolve such situation, as it
> will (slowly and quietly) make some issues like contributors leaving,
> module stopped growing up, etc.. The problem will grow up like a snowball:
> getting bigger and bigger. I don't mind if there's no interest on both
> contributors and committers for such module, but SS is not. Maybe either
> other committers who weren't familiar with should try to get familiar and
> cover the area, or the area needs more committers.
> > >
> > > -Jungtaek Lim (HeartSaVioR)
> > >
> > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
> > >>
> > >> Jungtaek, the best strategy is to find who wrote the code you are
> > >> modifying (use Github history or git blame) and ping them
> directly on
> > >> the PR. I don't 

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Reynold Xin
BTW the largest change to SS right now is probably the entire data source API 
v2 effort, which aims to unify streaming and batch from data source 
perspective, and provide a reliable, expressive source/sink API.

On Mon, Jan 14, 2019 at 5:34 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> There are a few things to keep in mind:
> 
> 
> 
> 1. Structured Streaming isn't an independent project. It actually (by
> design) depends on all the rest of Spark SQL, and virtually all
> improvements to Spark SQL benefit Structured Streaming.
> 
> 
> 
> 2. The project as far as I can tell is relatively mature for core ETL and
> incremental processing purpose. I interact with a lot of users using it
> everyday. We can always expand the use cases and add more, but that also
> adds maintenance burden. In any case, it'd be good to get some activity
> here.
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Jan 14, 2019 at 5:11 PM, Nicholas Chammas < nicholas. chammas@ gmail.
> com ( nicholas.cham...@gmail.com ) > wrote:
> 
>> As an observer, this thread is interesting and concerning. Is there an
>> emerging consensus that Structured Streaming is somehow not relevant
>> anymore? Or is it just that folks consider it "complete enough"?
>> 
>> 
>> Structured Streaming was billed as the replacement to DStreams. If
>> committers, generally speaking, have lost interest in Structured
>> Streaming, does that mean the Apache Spark project is somehow no longer
>> aiming to provide a "first-class" solution to the problem of stream
>> processing?
>> 
>> On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> Cody, I guess I already addressed your comments in the PR (#22138). The
>>> approach was changed to address your concern, and after that Gabor helped
>>> to review the PR. Please take a look again when you have time to get into.
>>> 
>>> 
>>> 
>>> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger < cody@ koeninger. org (
>>> c...@koeninger.org ) >님이 작성:
>>> 
>>> 
 I feel like I've already said my piece on
 https:/ / github. com/ apache/ spark/ pull/ 22138 (
 https://github.com/apache/spark/pull/22138 ) let me know if you have
 more questions.
 
 As for SS in general, I don't have a production SS deployment, so I'm
 less comfortable with reviewing large changes to it.  But if no other
 committers are working on it...
 
 On Sun, Jan 13, 2019 at 5:19 PM Sean Owen < srowen@ gmail. com (
 sro...@gmail.com ) > wrote:
 >
 > Yes you're preaching to the choir here. SS does seem somewhat
 > abandoned by those that have worked on it. I have also been at times
 > frustrated that some areas fall into this pattern.
 >
 > There isn't a way to make people work on it, and I personally am not
 > interested in it nor have a background in SS.
 >
 > I did leave some comments on your PR and will see if we can get
 > comfortable with merging it, as I presume you are pretty knowledgeable
 > about the change.
 >
 > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim < kabhwan@ gmail. com (
 kabh...@gmail.com ) > wrote:
 > >
 > > Sean, this is actually a fail-back on pinging committers. I know who
 can review and merge in SS area, and pinged to them, didn't work. Even
 there's a PR which approach was encouraged by committer and reviewed the
 first phase, and no review.
 > >
 > > That's not the first time I have faced the situation, and I used the
 fail-back approach at that time. (You can see there was no response even
 in the mail thread.) Not sure which approach worked.
 > > https:/ / lists. apache. org/ thread. html/ 
 > > c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@
 %3Cdev. spark. apache. org%3E (
 https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
 )
 > >
 > > I've observed that only (critical) bugfixes are being reviewed and
 merged in time for SS area. For other stuffs like new features and
 improvements, both discussions and PRs were pretty less popular from
 committers: though there was even participation/approve from non-committer
 community. I don't think SS is the thing to be turned into maintenance.
 > >
 > > I guess PMC members should try to resolve such situation, as it will
 (slowly and quietly) make some issues like contributors leaving, module
 stopped growing up, etc.. The problem will grow up like a snowball:
 getting bigger and bigger. I don't mind if there's no interest on both
 contributors and committers for such module, but SS is not. Maybe either
 other committers who weren't familiar with should try to get familiar and
 cover the area, or the area needs more committers.
 > >
 > > -Jungtaek Lim (HeartSaVioR)
 > >
 > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen < srowen@ gmail. com (
 

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Reynold Xin
There are a few things to keep in mind:

1. Structured Streaming isn't an independent project. It actually (by design) 
depends on all the rest of Spark SQL, and virtually all improvements to Spark 
SQL benefit Structured Streaming.

2. The project as far as I can tell is relatively mature for core ETL and 
incremental processing purpose. I interact with a lot of users using it 
everyday. We can always expand the use cases and add more, but that also adds 
maintenance burden. In any case, it'd be good to get some activity here.

On Mon, Jan 14, 2019 at 5:11 PM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> As an observer, this thread is interesting and concerning. Is there an
> emerging consensus that Structured Streaming is somehow not relevant
> anymore? Or is it just that folks consider it "complete enough"?
> 
> 
> Structured Streaming was billed as the replacement to DStreams. If
> committers, generally speaking, have lost interest in Structured
> Streaming, does that mean the Apache Spark project is somehow no longer
> aiming to provide a "first-class" solution to the problem of stream
> processing?
> 
> On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim < kabhwan@ gmail. com (
> kabh...@gmail.com ) > wrote:
> 
> 
>> Cody, I guess I already addressed your comments in the PR (#22138). The
>> approach was changed to address your concern, and after that Gabor helped
>> to review the PR. Please take a look again when you have time to get into.
>> 
>> 
>> 
>> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger < cody@ koeninger. org (
>> c...@koeninger.org ) >님이 작성:
>> 
>> 
>>> I feel like I've already said my piece on
>>> https:/ / github. com/ apache/ spark/ pull/ 22138 (
>>> https://github.com/apache/spark/pull/22138 ) let me know if you have
>>> more questions.
>>> 
>>> As for SS in general, I don't have a production SS deployment, so I'm
>>> less comfortable with reviewing large changes to it.  But if no other
>>> committers are working on it...
>>> 
>>> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> >
>>> > Yes you're preaching to the choir here. SS does seem somewhat
>>> > abandoned by those that have worked on it. I have also been at times
>>> > frustrated that some areas fall into this pattern.
>>> >
>>> > There isn't a way to make people work on it, and I personally am not
>>> > interested in it nor have a background in SS.
>>> >
>>> > I did leave some comments on your PR and will see if we can get
>>> > comfortable with merging it, as I presume you are pretty knowledgeable
>>> > about the change.
>>> >
>>> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim < kabhwan@ gmail. com (
>>> kabh...@gmail.com ) > wrote:
>>> > >
>>> > > Sean, this is actually a fail-back on pinging committers. I know who
>>> can review and merge in SS area, and pinged to them, didn't work. Even
>>> there's a PR which approach was encouraged by committer and reviewed the
>>> first phase, and no review.
>>> > >
>>> > > That's not the first time I have faced the situation, and I used the
>>> fail-back approach at that time. (You can see there was no response even
>>> in the mail thread.) Not sure which approach worked.
>>> > > https:/ / lists. apache. org/ thread. html/ 
>>> > > c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@
>>> %3Cdev. spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
>>> )
>>> > >
>>> > > I've observed that only (critical) bugfixes are being reviewed and
>>> merged in time for SS area. For other stuffs like new features and
>>> improvements, both discussions and PRs were pretty less popular from
>>> committers: though there was even participation/approve from non-committer
>>> community. I don't think SS is the thing to be turned into maintenance.
>>> > >
>>> > > I guess PMC members should try to resolve such situation, as it will
>>> (slowly and quietly) make some issues like contributors leaving, module
>>> stopped growing up, etc.. The problem will grow up like a snowball:
>>> getting bigger and bigger. I don't mind if there's no interest on both
>>> contributors and committers for such module, but SS is not. Maybe either
>>> other committers who weren't familiar with should try to get familiar and
>>> cover the area, or the area needs more committers.
>>> > >
>>> > > -Jungtaek Lim (HeartSaVioR)
>>> > >
>>> > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) >님이 작성:
>>> > >>
>>> > >> Jungtaek, the best strategy is to find who wrote the code you are
>>> > >> modifying (use Github history or git blame) and ping them directly on
>>> 
>>> > >> the PR. I don't know this code well myself.
>>> > >> It also helps if you can address why the functionality is important,
>>> > >> and describe compatibility implications.
>>> > >>
>>> > >> Most PRs are not merged, note. Not commenting on this particular one,
>>> 
>>> > >> 

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Nicholas Chammas
As an observer, this thread is interesting and concerning. Is there an
emerging consensus that Structured Streaming is somehow not relevant
anymore? Or is it just that folks consider it "complete enough"?

Structured Streaming was billed as the replacement to DStreams. If
committers, generally speaking, have lost interest in Structured Streaming,
does that mean the Apache Spark project is somehow no longer aiming to
provide a "first-class" solution to the problem of stream processing?

On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim  wrote:

> Cody, I guess I already addressed your comments in the PR (#22138). The
> approach was changed to address your concern, and after that Gabor helped
> to review the PR. Please take a look again when you have time to get into.
>
> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger 님이 작성:
>
>> I feel like I've already said my piece on
>> https://github.com/apache/spark/pull/22138 let me know if you have
>> more questions.
>>
>> As for SS in general, I don't have a production SS deployment, so I'm
>> less comfortable with reviewing large changes to it.  But if no other
>> committers are working on it...
>>
>> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
>> >
>> > Yes you're preaching to the choir here. SS does seem somewhat
>> > abandoned by those that have worked on it. I have also been at times
>> > frustrated that some areas fall into this pattern.
>> >
>> > There isn't a way to make people work on it, and I personally am not
>> > interested in it nor have a background in SS.
>> >
>> > I did leave some comments on your PR and will see if we can get
>> > comfortable with merging it, as I presume you are pretty knowledgeable
>> > about the change.
>> >
>> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim  wrote:
>> > >
>> > > Sean, this is actually a fail-back on pinging committers. I know who
>> can review and merge in SS area, and pinged to them, didn't work. Even
>> there's a PR which approach was encouraged by committer and reviewed the
>> first phase, and no review.
>> > >
>> > > That's not the first time I have faced the situation, and I used the
>> fail-back approach at that time. (You can see there was no response even in
>> the mail thread.) Not sure which approach worked.
>> > >
>> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
>> > >
>> > > I've observed that only (critical) bugfixes are being reviewed and
>> merged in time for SS area. For other stuffs like new features and
>> improvements, both discussions and PRs were pretty less popular from
>> committers: though there was even participation/approve from non-committer
>> community. I don't think SS is the thing to be turned into maintenance.
>> > >
>> > > I guess PMC members should try to resolve such situation, as it will
>> (slowly and quietly) make some issues like contributors leaving, module
>> stopped growing up, etc.. The problem will grow up like a snowball: getting
>> bigger and bigger. I don't mind if there's no interest on both contributors
>> and committers for such module, but SS is not. Maybe either other
>> committers who weren't familiar with should try to get familiar and cover
>> the area, or the area needs more committers.
>> > >
>> > > -Jungtaek Lim (HeartSaVioR)
>> > >
>> > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
>> > >>
>> > >> Jungtaek, the best strategy is to find who wrote the code you are
>> > >> modifying (use Github history or git blame) and ping them directly on
>> > >> the PR. I don't know this code well myself.
>> > >> It also helps if you can address why the functionality is important,
>> > >> and describe compatibility implications.
>> > >>
>> > >> Most PRs are not merged, note. Not commenting on this particular one,
>> > >> but it's not a 'bug' if it's not being merged.
>> > >>
>> > >> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim 
>> wrote:
>> > >> >
>> > >> > I'm sorry but let me remind this, as non-SS PRs are being reviewed
>> accordingly, whereas many of SS PRs (regardless of who create) are still
>> not reviewed and merged in time.
>> > >> >
>> > >> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
>> > >> >>
>> > >> >> Spark devs, happy new year!
>> > >> >>
>> > >> >> I would like to remind this kindly, since there was actually no
>> review after initiating the thread.
>> > >> >>
>> > >> >> Thanks,
>> > >> >> Jungtaek Lim (HeartSaVioR)
>> > >> >>
>> > >> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이
>> 작성:
>> > >> >>>
>> > >> >>> I am also waiting for any finalization of my PR [3]. I seems
>> that SS PRs are not being reviewed much these days.
>> > >> >>>
>> > >> >>> [3] https://github.com/apache/spark/pull/21919
>> > >> >>>
>> > >> >>>
>> > >> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
>> > >> >>>
>> > >> >>> If it is possible, could you review my PR on Kafka's header
>> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not
>> supported in Spark.
>> > >> >>>
>> > >> >>> Thanks,
>> > 

[DISCUSS] SPIP SPARK-26257

2019-01-14 Thread tcondie
Dear Spark Community,

 

I have posted a SPIP to JIRA:
https://issues.apache.org/jira/browse/SPARK-26257

 

I look forward to your feedback on the JIRA ticket.

 

Best regards,

Tyson



Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Jungtaek Lim
Cody, I guess I already addressed your comments in the PR (#22138). The
approach was changed to address your concern, and after that Gabor helped
to review the PR. Please take a look again when you have time to get into.

2019년 1월 15일 (화) 오전 1:01, Cody Koeninger 님이 작성:

> I feel like I've already said my piece on
> https://github.com/apache/spark/pull/22138 let me know if you have
> more questions.
>
> As for SS in general, I don't have a production SS deployment, so I'm
> less comfortable with reviewing large changes to it.  But if no other
> committers are working on it...
>
> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
> >
> > Yes you're preaching to the choir here. SS does seem somewhat
> > abandoned by those that have worked on it. I have also been at times
> > frustrated that some areas fall into this pattern.
> >
> > There isn't a way to make people work on it, and I personally am not
> > interested in it nor have a background in SS.
> >
> > I did leave some comments on your PR and will see if we can get
> > comfortable with merging it, as I presume you are pretty knowledgeable
> > about the change.
> >
> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim  wrote:
> > >
> > > Sean, this is actually a fail-back on pinging committers. I know who
> can review and merge in SS area, and pinged to them, didn't work. Even
> there's a PR which approach was encouraged by committer and reviewed the
> first phase, and no review.
> > >
> > > That's not the first time I have faced the situation, and I used the
> fail-back approach at that time. (You can see there was no response even in
> the mail thread.) Not sure which approach worked.
> > >
> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
> > >
> > > I've observed that only (critical) bugfixes are being reviewed and
> merged in time for SS area. For other stuffs like new features and
> improvements, both discussions and PRs were pretty less popular from
> committers: though there was even participation/approve from non-committer
> community. I don't think SS is the thing to be turned into maintenance.
> > >
> > > I guess PMC members should try to resolve such situation, as it will
> (slowly and quietly) make some issues like contributors leaving, module
> stopped growing up, etc.. The problem will grow up like a snowball: getting
> bigger and bigger. I don't mind if there's no interest on both contributors
> and committers for such module, but SS is not. Maybe either other
> committers who weren't familiar with should try to get familiar and cover
> the area, or the area needs more committers.
> > >
> > > -Jungtaek Lim (HeartSaVioR)
> > >
> > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
> > >>
> > >> Jungtaek, the best strategy is to find who wrote the code you are
> > >> modifying (use Github history or git blame) and ping them directly on
> > >> the PR. I don't know this code well myself.
> > >> It also helps if you can address why the functionality is important,
> > >> and describe compatibility implications.
> > >>
> > >> Most PRs are not merged, note. Not commenting on this particular one,
> > >> but it's not a 'bug' if it's not being merged.
> > >>
> > >> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim 
> wrote:
> > >> >
> > >> > I'm sorry but let me remind this, as non-SS PRs are being reviewed
> accordingly, whereas many of SS PRs (regardless of who create) are still
> not reviewed and merged in time.
> > >> >
> > >> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
> > >> >>
> > >> >> Spark devs, happy new year!
> > >> >>
> > >> >> I would like to remind this kindly, since there was actually no
> review after initiating the thread.
> > >> >>
> > >> >> Thanks,
> > >> >> Jungtaek Lim (HeartSaVioR)
> > >> >>
> > >> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이
> 작성:
> > >> >>>
> > >> >>> I am also waiting for any finalization of my PR [3]. I seems that
> SS PRs are not being reviewed much these days.
> > >> >>>
> > >> >>> [3] https://github.com/apache/spark/pull/21919
> > >> >>>
> > >> >>>
> > >> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
> > >> >>>
> > >> >>> If it is possible, could you review my PR on Kafka's header
> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not
> supported in Spark.
> > >> >>>
> > >> >>> Thanks,
> > >> >>> Dongjin
> > >> >>>
> > >> >>> [^1]: https://github.com/apache/spark/pull/22282
> > >> >>> [^2]: https://issues.apache.org/jira/browse/KAFKA-4208
> > >> >>>
> > >> >>> On Wed, Dec 12, 2018 at 6:43 PM Jungtaek Lim 
> wrote:
> > >> 
> > >>  Hi devs,
> > >> 
> > >>  Would I kindly ask for reviewing on PRs for Structured
> Streaming? I have 5 open pull requests on SS side [1] (earliest PR was
> opened around 4 months so far), and there looks like couple of PR for
> others [2] which looks good to be reviewed, too.
> > >> 
> > >>  Thanks in advance,
> > >>  Jungtaek Lim (HeartSaVioR)
> > >> 
> > >>  1.
> 

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Jungtaek Lim
Sad to hear that. While I understand such thing can be happened for any
project, it feels me to a kind of bad sign that non-experimental major
feature which has no alternative is getting lost on interest.

I also fully agree that there isn't a way to make people work on it (I also
had encountered similar situation in most of projects which I involved as
one of committers or PMC members), but things might get better based on how
we deal with such situation: given there're some people (not only me) would
like to work on SS and they're feeling stuck.

I really appreciate your help on trying to review PRs which area you're not
comfortable. I understand that's not the easy one. Thanks for doing that!

2019년 1월 14일 (월) 오전 8:19, Sean Owen 님이 작성:

> Yes you're preaching to the choir here. SS does seem somewhat
> abandoned by those that have worked on it. I have also been at times
> frustrated that some areas fall into this pattern.
>
> There isn't a way to make people work on it, and I personally am not
> interested in it nor have a background in SS.
>
> I did leave some comments on your PR and will see if we can get
> comfortable with merging it, as I presume you are pretty knowledgeable
> about the change.
>
> On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim  wrote:
> >
> > Sean, this is actually a fail-back on pinging committers. I know who can
> review and merge in SS area, and pinged to them, didn't work. Even there's
> a PR which approach was encouraged by committer and reviewed the first
> phase, and no review.
> >
> > That's not the first time I have faced the situation, and I used the
> fail-back approach at that time. (You can see there was no response even in
> the mail thread.) Not sure which approach worked.
> >
> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
> >
> > I've observed that only (critical) bugfixes are being reviewed and
> merged in time for SS area. For other stuffs like new features and
> improvements, both discussions and PRs were pretty less popular from
> committers: though there was even participation/approve from non-committer
> community. I don't think SS is the thing to be turned into maintenance.
> >
> > I guess PMC members should try to resolve such situation, as it will
> (slowly and quietly) make some issues like contributors leaving, module
> stopped growing up, etc.. The problem will grow up like a snowball: getting
> bigger and bigger. I don't mind if there's no interest on both contributors
> and committers for such module, but SS is not. Maybe either other
> committers who weren't familiar with should try to get familiar and cover
> the area, or the area needs more committers.
> >
> > -Jungtaek Lim (HeartSaVioR)
> >
> > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
> >>
> >> Jungtaek, the best strategy is to find who wrote the code you are
> >> modifying (use Github history or git blame) and ping them directly on
> >> the PR. I don't know this code well myself.
> >> It also helps if you can address why the functionality is important,
> >> and describe compatibility implications.
> >>
> >> Most PRs are not merged, note. Not commenting on this particular one,
> >> but it's not a 'bug' if it's not being merged.
> >>
> >> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim 
> wrote:
> >> >
> >> > I'm sorry but let me remind this, as non-SS PRs are being reviewed
> accordingly, whereas many of SS PRs (regardless of who create) are still
> not reviewed and merged in time.
> >> >
> >> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
> >> >>
> >> >> Spark devs, happy new year!
> >> >>
> >> >> I would like to remind this kindly, since there was actually no
> review after initiating the thread.
> >> >>
> >> >> Thanks,
> >> >> Jungtaek Lim (HeartSaVioR)
> >> >>
> >> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이
> 작성:
> >> >>>
> >> >>> I am also waiting for any finalization of my PR [3]. I seems that
> SS PRs are not being reviewed much these days.
> >> >>>
> >> >>> [3] https://github.com/apache/spark/pull/21919
> >> >>>
> >> >>>
> >> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
> >> >>>
> >> >>> If it is possible, could you review my PR on Kafka's header
> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not
> supported in Spark.
> >> >>>
> >> >>> Thanks,
> >> >>> Dongjin
> >> >>>
> >> >>> [^1]: https://github.com/apache/spark/pull/22282
> >> >>> [^2]: https://issues.apache.org/jira/browse/KAFKA-4208
> >> >>>
> >> >>> On Wed, Dec 12, 2018 at 6:43 PM Jungtaek Lim 
> wrote:
> >> 
> >>  Hi devs,
> >> 
> >>  Would I kindly ask for reviewing on PRs for Structured Streaming?
> I have 5 open pull requests on SS side [1] (earliest PR was opened around 4
> months so far), and there looks like couple of PR for others [2] which
> looks good to be reviewed, too.
> >> 
> >>  Thanks in advance,
> >>  Jungtaek Lim (HeartSaVioR)
> >> 
> >>  1.
> 

Re: [build system] jenkins mildly wedged, needs a quick restart

2019-01-14 Thread shane knapp
alright, everything seems to be working as expected!  :)

On Mon, Jan 14, 2019 at 11:07 AM shane knapp  wrote:

> we're back up and building...  things still seem a little flaky so i'll be
> investigating a little bit deeper in to what's doing on.
>
> On Mon, Jan 14, 2019 at 10:55 AM shane knapp  wrote:
>
>> this will kill a bunch of PRB builds, so i'll go and retrigger them once
>> jenkins is back up.
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins mildly wedged, needs a quick restart

2019-01-14 Thread shane knapp
we're back up and building...  things still seem a little flaky so i'll be
investigating a little bit deeper in to what's doing on.

On Mon, Jan 14, 2019 at 10:55 AM shane knapp  wrote:

> this will kill a bunch of PRB builds, so i'll go and retrigger them once
> jenkins is back up.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins mildly wedged, needs a quick restart

2019-01-14 Thread shane knapp
this will kill a bunch of PRB builds, so i'll go and retrigger them once
jenkins is back up.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [mllib] Document frequency

2019-01-14 Thread Jatin Puri
Thanks. Created: https://issues.apache.org/jira/browse/SPARK-26616

On Mon, Jan 14, 2019 at 9:19 PM Sean Owen  wrote:

> Yes that seems OK to me.
>
> On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri  wrote:
> >
> > Thanks for the response. So do I go ahead and create a jira ticket?
> > Can then send a pull request for the same with the changes.
> >
> > On Mon, Jan 14, 2019 at 8:18 PM Sean Owen  wrote:
> >>
> >> I think that's reasonable. The caller probably has the number of docs
> >> already but sure, it's one long and is already computed. This would
> >> have to be added to Pyspark too.
> >>
> >> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri  wrote:
> >> >
> >> > Hello.
> >> >
> >> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a
> good idea to also expose:
> >> >
> >> > 1. Document frequency vector
> >> > 2. Number of documents
> >> >
> >> > We get the above for free currently and they just need to be exposed
> as public val.
> >> >
> >> > This avoids re-implementation for someone who needs to compute
> DocumentFrequency of terms. Currently if someone needs df, then one would
> need to reverse compute it based on the idf values obtained.
> >> >
> >> > Afaik, we dont explicitly provide such a functionality in mllib. And
> we don't need to have a separate class, if we can expose it in `IDFModel`
> itself.
> >> >
> >> > Does it sound alright?
> >> >
> >> > Regards,
> >> > Jatin
> >> >
> >
> >
> >
> > --
> > Jatin Puri
> > http://jatinpuri.com
> >
>


-- 
Jatin Puri
http://jatinpuri.com 


Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Cody Koeninger
I feel like I've already said my piece on
https://github.com/apache/spark/pull/22138 let me know if you have
more questions.

As for SS in general, I don't have a production SS deployment, so I'm
less comfortable with reviewing large changes to it.  But if no other
committers are working on it...

On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
>
> Yes you're preaching to the choir here. SS does seem somewhat
> abandoned by those that have worked on it. I have also been at times
> frustrated that some areas fall into this pattern.
>
> There isn't a way to make people work on it, and I personally am not
> interested in it nor have a background in SS.
>
> I did leave some comments on your PR and will see if we can get
> comfortable with merging it, as I presume you are pretty knowledgeable
> about the change.
>
> On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim  wrote:
> >
> > Sean, this is actually a fail-back on pinging committers. I know who can 
> > review and merge in SS area, and pinged to them, didn't work. Even there's 
> > a PR which approach was encouraged by committer and reviewed the first 
> > phase, and no review.
> >
> > That's not the first time I have faced the situation, and I used the 
> > fail-back approach at that time. (You can see there was no response even in 
> > the mail thread.) Not sure which approach worked.
> > https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
> >
> > I've observed that only (critical) bugfixes are being reviewed and merged 
> > in time for SS area. For other stuffs like new features and improvements, 
> > both discussions and PRs were pretty less popular from committers: though 
> > there was even participation/approve from non-committer community. I don't 
> > think SS is the thing to be turned into maintenance.
> >
> > I guess PMC members should try to resolve such situation, as it will 
> > (slowly and quietly) make some issues like contributors leaving, module 
> > stopped growing up, etc.. The problem will grow up like a snowball: getting 
> > bigger and bigger. I don't mind if there's no interest on both contributors 
> > and committers for such module, but SS is not. Maybe either other 
> > committers who weren't familiar with should try to get familiar and cover 
> > the area, or the area needs more committers.
> >
> > -Jungtaek Lim (HeartSaVioR)
> >
> > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
> >>
> >> Jungtaek, the best strategy is to find who wrote the code you are
> >> modifying (use Github history or git blame) and ping them directly on
> >> the PR. I don't know this code well myself.
> >> It also helps if you can address why the functionality is important,
> >> and describe compatibility implications.
> >>
> >> Most PRs are not merged, note. Not commenting on this particular one,
> >> but it's not a 'bug' if it's not being merged.
> >>
> >> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim  wrote:
> >> >
> >> > I'm sorry but let me remind this, as non-SS PRs are being reviewed 
> >> > accordingly, whereas many of SS PRs (regardless of who create) are still 
> >> > not reviewed and merged in time.
> >> >
> >> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
> >> >>
> >> >> Spark devs, happy new year!
> >> >>
> >> >> I would like to remind this kindly, since there was actually no review 
> >> >> after initiating the thread.
> >> >>
> >> >> Thanks,
> >> >> Jungtaek Lim (HeartSaVioR)
> >> >>
> >> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이 작성:
> >> >>>
> >> >>> I am also waiting for any finalization of my PR [3]. I seems that SS 
> >> >>> PRs are not being reviewed much these days.
> >> >>>
> >> >>> [3] https://github.com/apache/spark/pull/21919
> >> >>>
> >> >>>
> >> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
> >> >>>
> >> >>> If it is possible, could you review my PR on Kafka's header 
> >> >>> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not 
> >> >>> supported in Spark.
> >> >>>
> >> >>> Thanks,
> >> >>> Dongjin
> >> >>>
> >> >>> [^1]: https://github.com/apache/spark/pull/22282
> >> >>> [^2]: https://issues.apache.org/jira/browse/KAFKA-4208
> >> >>>
> >> >>> On Wed, Dec 12, 2018 at 6:43 PM Jungtaek Lim  wrote:
> >> 
> >>  Hi devs,
> >> 
> >>  Would I kindly ask for reviewing on PRs for Structured Streaming? I 
> >>  have 5 open pull requests on SS side [1] (earliest PR was opened 
> >>  around 4 months so far), and there looks like couple of PR for others 
> >>  [2] which looks good to be reviewed, too.
> >> 
> >>  Thanks in advance,
> >>  Jungtaek Lim (HeartSaVioR)
> >> 
> >>  1. 
> >>  https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Aopen+is%3Apr+author%3AHeartSaVioR+%5BSS%5D
> >>  2. 
> >>  https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Aopen+is%3Apr+%5BSS%5D+
> >> 
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Dongjin Lee
> >> >>>
> >> >>> A hitchhiker in the mathematical world.

Re: [mllib] Document frequency

2019-01-14 Thread Sean Owen
Yes that seems OK to me.

On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri  wrote:
>
> Thanks for the response. So do I go ahead and create a jira ticket?
> Can then send a pull request for the same with the changes.
>
> On Mon, Jan 14, 2019 at 8:18 PM Sean Owen  wrote:
>>
>> I think that's reasonable. The caller probably has the number of docs
>> already but sure, it's one long and is already computed. This would
>> have to be added to Pyspark too.
>>
>> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri  wrote:
>> >
>> > Hello.
>> >
>> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good 
>> > idea to also expose:
>> >
>> > 1. Document frequency vector
>> > 2. Number of documents
>> >
>> > We get the above for free currently and they just need to be exposed as 
>> > public val.
>> >
>> > This avoids re-implementation for someone who needs to compute 
>> > DocumentFrequency of terms. Currently if someone needs df, then one would 
>> > need to reverse compute it based on the idf values obtained.
>> >
>> > Afaik, we dont explicitly provide such a functionality in mllib. And we 
>> > don't need to have a separate class, if we can expose it in `IDFModel` 
>> > itself.
>> >
>> > Does it sound alright?
>> >
>> > Regards,
>> > Jatin
>> >
>
>
>
> --
> Jatin Puri
> http://jatinpuri.com
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [mllib] Document frequency

2019-01-14 Thread Jatin Puri
Thanks for the response. So do I go ahead and create a jira ticket?
Can then send a pull request for the same with the changes.

On Mon, Jan 14, 2019 at 8:18 PM Sean Owen  wrote:

> I think that's reasonable. The caller probably has the number of docs
> already but sure, it's one long and is already computed. This would
> have to be added to Pyspark too.
>
> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri  wrote:
> >
> > Hello.
> >
> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good
> idea to also expose:
> >
> > 1. Document frequency vector
> > 2. Number of documents
> >
> > We get the above for free currently and they just need to be exposed as
> public val.
> >
> > This avoids re-implementation for someone who needs to compute
> DocumentFrequency of terms. Currently if someone needs df, then one would
> need to reverse compute it based on the idf values obtained.
> >
> > Afaik, we dont explicitly provide such a functionality in mllib. And we
> don't need to have a separate class, if we can expose it in `IDFModel`
> itself.
> >
> > Does it sound alright?
> >
> > Regards,
> > Jatin
> >
>


-- 
Jatin Puri
http://jatinpuri.com 


Re: [mllib] Document frequency

2019-01-14 Thread Sean Owen
I think that's reasonable. The caller probably has the number of docs
already but sure, it's one long and is already computed. This would
have to be added to Pyspark too.

On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri  wrote:
>
> Hello.
>
> As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good idea 
> to also expose:
>
> 1. Document frequency vector
> 2. Number of documents
>
> We get the above for free currently and they just need to be exposed as 
> public val.
>
> This avoids re-implementation for someone who needs to compute 
> DocumentFrequency of terms. Currently if someone needs df, then one would 
> need to reverse compute it based on the idf values obtained.
>
> Afaik, we dont explicitly provide such a functionality in mllib. And we don't 
> need to have a separate class, if we can expose it in `IDFModel` itself.
>
> Does it sound alright?
>
> Regards,
> Jatin
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[mllib] Document frequency

2019-01-14 Thread Jatin Puri
Hello.

As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good
idea to also expose:

1. Document frequency vector
2. Number of documents

We get the above for free currently and they just need to be exposed as
public val.

This avoids re-implementation for someone who needs to compute
DocumentFrequency of terms. Currently if someone needs df, then one would
need to reverse compute it based on the idf values obtained.

Afaik, we dont explicitly provide such a functionality in mllib. And we
don't need to have a separate class, if we can expose it in `IDFModel`
itself.

Does it sound alright?

Regards,
Jatin