Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matei Zaharia
To add to this, we can add a stable interface anytime if the original one was 
marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of APIs 
that were experimental in 2.0 and then got stabilized in later 2.x releases for 
example.

Matei

> On Feb 26, 2019, at 5:12 PM, Reynold Xin  wrote:
> 
> We will have to fix that before we declare dev2 is stable, because 
> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. 
> 
> On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:
> Will that then require an API break down the line? Do we save that for Spark 
> 4?
> 
> 
>  
> 
> -Matt Cheah?
> 
>  
> 
> From: Ryan Blue 
> Reply-To: "rb...@netflix.com" 
> Date: Tuesday, February 26, 2019 at 4:53 PM
> To: Matt Cheah 
> Cc: Sean Owen , Wenchen Fan , Xiao Li 
> , Matei Zaharia , Spark Dev 
> List 
> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> 
>  
> 
> That's a good question.
> 
>  
> 
> While I'd love to have a solution for that, I don't think it is a good idea 
> to delay DSv2 until we have one. That is going to require a lot of internal 
> changes and I don't see how we could make the release date if we are 
> including an InternalRow replacement.
> 
>  
> 
> On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
> 
> Reynold made a note earlier about a proper Row API that isn’t InternalRow – 
> is that still on the table?
> 
>  
> 
> -Matt Cheah
> 
>  
> 
> From: Ryan Blue 
> Reply-To: "rb...@netflix.com" 
> Date: Tuesday, February 26, 2019 at 4:40 PM
> To: Matt Cheah 
> Cc: Sean Owen , Wenchen Fan , Xiao Li 
> , Matei Zaharia , Spark Dev 
> List 
> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2
> 
>  
> 
> Thanks for bumping this, Matt. I think we can have the discussion here to 
> clarify exactly what we’re committing to and then have a vote thread once 
> we’re agreed.
> Getting back to the DSv2 discussion, I think we have a good handle on what 
> would be added:
> · Plugin system for catalogs
> 
> · TableCatalog interface (I’ll start a vote thread for this SPIP 
> shortly)
> 
> · TableCatalog implementation backed by SessionCatalog that can load 
> v2 tables
> 
> · Resolution rule to load v2 tables using the new catalog
> 
> · CTAS logical and physical plan nodes
> 
> · Conversions from SQL parsed logical plans to v2 logical plans
> 
> Initially, this will always use the v2 catalog backed by SessionCatalog to 
> avoid dependence on the multi-catalog work. All of those are already 
> implemented and working, so I think it is reasonable that we can get them in.
> Then we can consider a few stretch goals:
> · Get in as much DDL as we can. I think create and drop table should 
> be easy.
> 
> · Multi-catalog identifier parsing and multi-catalog support
> 
> If we get those last two in, it would be great. We can make the call closer 
> to release time. Does anyone want to change this set of work?
>  
> 
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
> 
> What would then be the next steps we'd take to collectively decide on plans 
> and timelines moving forward? Might I suggest scheduling a conference call 
> with appropriate PMCs to put our ideas together? Maybe such a discussion can 
> take place at next week's meeting? Or do we need to have a separate 
> formalized voting thread which is guided by a PMC?
> 
> My suggestion is to try to make concrete steps forward and to avoid letting 
> this slip through the cracks.
> 
> I also think there would be merits to having a project plan and estimates 
> around how long each of the features we want to complete is going to take to 
> implement and review.
> 
> -Matt Cheah
> 
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
> 
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
> 
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
> 
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
> 
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
> wrote:
> > There is nothing wrong with individuals advocating for what they think 
> should or should not be in Spark 3.0, nor should anyone shy away from 
> explaining why they think delaying the release for some reason is or isn't a 
> good idea. What is a problem, or is at least something that I have a problem 
> with, are declarative, 

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Reynold Xin
We will have to fix that before we declare dev2 is stable, because
InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.

On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:

> Will that then require an API break down the line? Do we save that for
> Spark 4?
>
>
>
> -Matt Cheah?
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:53 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> That's a good question.
>
>
>
> While I'd love to have a solution for that, I don't think it is a good
> idea to delay DSv2 until we have one. That is going to require a lot of
> internal changes and I don't see how we could make the release date if we
> are including an InternalRow replacement.
>
>
>
> On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
>
> Reynold made a note earlier about a proper Row API that isn’t InternalRow
> – is that still on the table?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:40 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> Thanks for bumping this, Matt. I think we can have the discussion here to
> clarify exactly what we’re committing to and then have a vote thread once
> we’re agreed.
>
> Getting back to the DSv2 discussion, I think we have a good handle on what
> would be added:
>
> · Plugin system for catalogs
>
> · TableCatalog interface (I’ll start a vote thread for this SPIP
> shortly)
>
> · TableCatalog implementation backed by SessionCatalog that can
> load v2 tables
>
> · Resolution rule to load v2 tables using the new catalog
>
> · CTAS logical and physical plan nodes
>
> · Conversions from SQL parsed logical plans to v2 logical plans
>
> Initially, this will always use the v2 catalog backed by SessionCatalog to
> avoid dependence on the multi-catalog work. All of those are already
> implemented and working, so I think it is reasonable that we can get them
> in.
>
> Then we can consider a few stretch goals:
>
> · Get in as much DDL as we can. I think create and drop table
> should be easy.
>
> · Multi-catalog identifier parsing and multi-catalog support
>
> If we get those last two in, it would be great. We can make the call
> closer to release time. Does anyone want to change this set of work?
>
>
>
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>
> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Will that then require an API break down the line? Do we save that for Spark 4?

 

-Matt Cheah?

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, February 26, 2019 at 4:53 PM
To: Matt Cheah 
Cc: Sean Owen , Wenchen Fan , Xiao Li 
, Matei Zaharia , Spark Dev 
List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

That's a good question. 

 

While I'd love to have a solution for that, I don't think it is a good idea to 
delay DSv2 until we have one. That is going to require a lot of internal 
changes and I don't see how we could make the release date if we are including 
an InternalRow replacement.

 

On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:

Reynold made a note earlier about a proper Row API that isn’t InternalRow – is 
that still on the table?

 

-Matt Cheah

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, February 26, 2019 at 4:40 PM
To: Matt Cheah 
Cc: Sean Owen , Wenchen Fan , Xiao Li 
, Matei Zaharia , Spark Dev 
List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

Thanks for bumping this, Matt. I think we can have the discussion here to 
clarify exactly what we’re committing to and then have a vote thread once we’re 
agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what 
would be added:

· Plugin system for catalogs

· TableCatalog interface (I’ll start a vote thread for this SPIP 
shortly)

· TableCatalog implementation backed by SessionCatalog that can load v2 
tables

· Resolution rule to load v2 tables using the new catalog

· CTAS logical and physical plan nodes

· Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to 
avoid dependence on the multi-catalog work. All of those are already 
implemented and working, so I think it is reasonable that we can get them in.

Then we can consider a few stretch goals:

· Get in as much DDL as we can. I think create and drop table should be 
easy.

· Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer to 
release time. Does anyone want to change this set of work?

 

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:

What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen"  wrote:

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
wrote:
> There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"


 

-- 

Ryan Blue 

Software Engineer

Netflix


 

-- 

Ryan Blue 

Software Engineer

Netflix



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
That's a good question.

While I'd love to have a solution for that, I don't think it is a good idea
to delay DSv2 until we have one. That is going to require a lot of internal
changes and I don't see how we could make the release date if we are
including an InternalRow replacement.

On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:

> Reynold made a note earlier about a proper Row API that isn’t InternalRow
> – is that still on the table?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:40 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> Thanks for bumping this, Matt. I think we can have the discussion here to
> clarify exactly what we’re committing to and then have a vote thread once
> we’re agreed.
>
> Getting back to the DSv2 discussion, I think we have a good handle on what
> would be added:
>
> · Plugin system for catalogs
>
> · TableCatalog interface (I’ll start a vote thread for this SPIP
> shortly)
>
> · TableCatalog implementation backed by SessionCatalog that can
> load v2 tables
>
> · Resolution rule to load v2 tables using the new catalog
>
> · CTAS logical and physical plan nodes
>
> · Conversions from SQL parsed logical plans to v2 logical plans
>
> Initially, this will always use the v2 catalog backed by SessionCatalog to
> avoid dependence on the multi-catalog work. All of those are already
> implemented and working, so I think it is reasonable that we can get them
> in.
>
> Then we can consider a few stretch goals:
>
> · Get in as much DDL as we can. I think create and drop table
> should be easy.
>
> · Multi-catalog identifier parsing and multi-catalog support
>
> If we get those last two in, it would be great. We can make the call
> closer to release time. Does anyone want to change this set of work?
>
>
>
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>
> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
Thanks for bumping this, Matt. I think we can have the discussion here to
clarify exactly what we’re committing to and then have a vote thread once
we’re agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what
would be added:

   - Plugin system for catalogs
   - TableCatalog interface (I’ll start a vote thread for this SPIP shortly)
   - TableCatalog implementation backed by SessionCatalog that can load v2
   tables
   - Resolution rule to load v2 tables using the new catalog
   - CTAS logical and physical plan nodes
   - Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to
avoid dependence on the multi-catalog work. All of those are already
implemented and working, so I think it is reasonable that we can get them
in.

Then we can consider a few stretch goals:

   - Get in as much DDL as we can. I think create and drop table should be
   easy.
   - Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer
to release time. Does anyone want to change this set of work?

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:

> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Reynold made a note earlier about a proper Row API that isn’t InternalRow – is 
that still on the table?

 

-Matt Cheah

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, February 26, 2019 at 4:40 PM
To: Matt Cheah 
Cc: Sean Owen , Wenchen Fan , Xiao Li 
, Matei Zaharia , Spark Dev 
List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

Thanks for bumping this, Matt. I think we can have the discussion here to 
clarify exactly what we’re committing to and then have a vote thread once we’re 
agreed.

Getting back to the DSv2 discussion, I think we have a good handle on what 
would be added:

· Plugin system for catalogs

· TableCatalog interface (I’ll start a vote thread for this SPIP 
shortly)

· TableCatalog implementation backed by SessionCatalog that can load v2 
tables

· Resolution rule to load v2 tables using the new catalog

· CTAS logical and physical plan nodes

· Conversions from SQL parsed logical plans to v2 logical plans

Initially, this will always use the v2 catalog backed by SessionCatalog to 
avoid dependence on the multi-catalog work. All of those are already 
implemented and working, so I think it is reasonable that we can get them in.

Then we can consider a few stretch goals:

· Get in as much DDL as we can. I think create and drop table should be 
easy.

· Multi-catalog identifier parsing and multi-catalog support

If we get those last two in, it would be great. We can make the call closer to 
release time. Does anyone want to change this set of work?

 

On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:

What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen"  wrote:

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
wrote:
> There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"


 

-- 

Ryan Blue 

Software Engineer

Netflix



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
What would then be the next steps we'd take to collectively decide on plans and 
timelines moving forward? Might I suggest scheduling a conference call with 
appropriate PMCs to put our ideas together? Maybe such a discussion can take 
place at next week's meeting? Or do we need to have a separate formalized 
voting thread which is guided by a PMC?

My suggestion is to try to make concrete steps forward and to avoid letting 
this slip through the cracks.

I also think there would be merits to having a project plan and estimates 
around how long each of the features we want to complete is going to take to 
implement and review.

-Matt Cheah

On 2/24/19, 3:05 PM, "Sean Owen"  wrote:

Sure, I don't read anyone making these statements though? Let's assume
good intent, that "foo should happen" as "my opinion as a member of
the community, which is not solely up to me, is that foo should
happen". I understand it's possible for a person to make their opinion
over-weighted; this whole style of decision making assumes good actors
and doesn't optimize against bad ones. Not that it can't happen, just
not seeing it here.

I have never seen any vote on a feature list, by a PMC or otherwise.
We can do that if really needed I guess. But that also isn't the
authoritative process in play here, in contrast.

If there's not a more specific subtext or issue here, which is fine to
say (on private@ if it's sensitive or something), yes, let's move on
in good faith.

On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra  
wrote:
> There is nothing wrong with individuals advocating for what they think 
should or should not be in Spark 3.0, nor should anyone shy away from 
explaining why they think delaying the release for some reason is or isn't a 
good idea. What is a problem, or is at least something that I have a problem 
with, are declarative, pseudo-authoritative statements that 3.0 (or some other 
release) will or won't contain some feature, API, etc. or that some issue is or 
is not blocker or worth delaying for. When the PMC has not voted on such 
issues, I'm often left thinking, "Wait... what? Who decided that, or where did 
that decision come from?"



smime.p7s
Description: S/MIME cryptographic signature


Re: Request review for long-standing PRs

2019-02-26 Thread Arun Mahadevan
Yes, I agree thats its a valid concern and leads to individual contributors
giving up on new ideas or major improvements.

On Tue, 26 Feb 2019 at 15:24, Jungtaek Lim  wrote:

> Adding one more, it implicitly leads individual contributors to give up
> with challenging major things and just focus on minor things, which would
> even help on project, but not in the long run. We don't have roadmap put
> into wall and let whole community share the load together, so individual
> contributors have lots of risks on putting major efforts - shouldn't
> conflict to what others have been doing privately, should be accepted after
> putting numerous effort to design and have POC.
>
> 2019년 2월 27일 (수) 오전 8:14, Jungtaek Lim 님이 작성:
>
>> Thanks Sean, as always, to share your thought quickly!
>>
>> I agree most of points, except "they add a lot of code and complexity
>> relative to benefit", since no one can weigh on something before at least
>> taking quick review. IMHO if someone would think so, better to speak (I
>> know it's hard and being a chance to be blamed, but better than giving
>> meaningless hope, yeah I admit I might be another one to do that in another
>> project) and see how others will weigh, rather than let it put aside and
>> ignore.
>>
>> I guess my target is already simple and targeted since I've only
>> mentioned SS area - there're not much committers who can review SS area.
>> Thing to consider is, I have PRs in other areas as well, and I don't have
>> issue on these areas. The reason of posting it to dev mailing list instead
>> of periodically ping in Github PR is, 1) ping in PR just doesn't work 2)
>> let others - hopefully PMC members - indicate a lack on activity on SS
>> area, and lead some action.
>>
>> 2019년 2월 27일 (수) 오전 7:57, Sean Owen 님이 작성:
>>
>>> Those aren't bad changes, but they add a lot of code and complexity
>>> relative to benefit. I think it's positive that you've gotten people
>>> to spend time reviewing them, quite a lot. I don't know whether they
>>> should be merged. This isn't a 'bug' though; not all changes should be
>>> committed. Simple and targeted is much easier to say yes to, because
>>> you implicitly here ask a lot of people to assume responsibility for
>>> your change.
>>>
>>> On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim  wrote:
>>> >
>>> > Hi devs,
>>> >
>>> > sorry to bring this again to mailing list, but you know, ping in
>>> Github PR just doesn't work.
>>> >
>>> > I have long-stand (created in last year) PRs on SS area which already
>>> got over 100 comments (so community and me already put lots of efforts) but
>>> no progress in point of view for being merged unfortunately lack of
>>> committers' attention.
>>> >
>>> > - SPARK-20568 [1] : Provide option to clean up completed files in
>>> streaming query
>>> > - SPARK-25151 [2] : Apply Apache Commons Pool to KafkaDataConsumer
>>> >
>>> > According to my experiences on previous PRs (including other areas),
>>> it won't take more than 1 months regardless of size of code diff to merge
>>> once committer(s) gave a focus on PR and reviewed.
>>> >
>>> > Thanks,
>>> > Jungtaek Lim (HeartSaVioR)
>>> >
>>> > ps. I may agree all committers in SS area could be busy (It might
>>> clearly represent SS area lacks committers), but I may not agree they're
>>> involved in DSv2 and DSv2 is the first thing to focus. I haven't seen
>>> anyone in participants on DSv2 discussions, and most of PRs in SS area is
>>> parallel to DSv2 so I'm wondering why we try to couple SS area with DSv2
>>> and restrict its evolution.
>>> >
>>> > ps2. Some of above is the part of previous mail thread regarding "Plan
>>> on Structured Streaming in next major/minor release?" [3]
>>> >
>>> > I'm sure I still would like to address other items in the list (or
>>> new), but without fast feedback it would not be possible. (Maintaining
>>> multiple of long-lasting PRs make contributors very tired, and sometimes
>>> worse than giving -1 and providing reason to reject.)
>>> >
>>> > 1. https://github.com/apache/spark/pull/22952
>>> > 2. https://github.com/apache/spark/pull/22138
>>> > 3.
>>> https://lists.apache.org/thread.html/e6c8a530c998c4a2bb12b167f815d3726d155ce722047957e32689df@%3Cdev.spark.apache.org%3E
>>>
>>


Re: Request review for long-standing PRs

2019-02-26 Thread Sean Owen
Mr Torres can you give these a pass please?

On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim  wrote:
>
> Hi devs,
>
> sorry to bring this again to mailing list, but you know, ping in Github PR 
> just doesn't work.
>
> I have long-stand (created in last year) PRs on SS area which already got 
> over 100 comments (so community and me already put lots of efforts) but no 
> progress in point of view for being merged unfortunately lack of committers' 
> attention.
>
> - SPARK-20568 [1] : Provide option to clean up completed files in streaming 
> query
> - SPARK-25151 [2] : Apply Apache Commons Pool to KafkaDataConsumer
>
> According to my experiences on previous PRs (including other areas), it won't 
> take more than 1 months regardless of size of code diff to merge once 
> committer(s) gave a focus on PR and reviewed.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps. I may agree all committers in SS area could be busy (It might clearly 
> represent SS area lacks committers), but I may not agree they're involved in 
> DSv2 and DSv2 is the first thing to focus. I haven't seen anyone in 
> participants on DSv2 discussions, and most of PRs in SS area is parallel to 
> DSv2 so I'm wondering why we try to couple SS area with DSv2 and restrict its 
> evolution.
>
> ps2. Some of above is the part of previous mail thread regarding "Plan on 
> Structured Streaming in next major/minor release?" [3]
>
> I'm sure I still would like to address other items in the list (or new), but 
> without fast feedback it would not be possible. (Maintaining multiple of 
> long-lasting PRs make contributors very tired, and sometimes worse than 
> giving -1 and providing reason to reject.)
>
> 1. https://github.com/apache/spark/pull/22952
> 2. https://github.com/apache/spark/pull/22138
> 3. 
> https://lists.apache.org/thread.html/e6c8a530c998c4a2bb12b167f815d3726d155ce722047957e32689df@%3Cdev.spark.apache.org%3E

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Request review for long-standing PRs

2019-02-26 Thread Jungtaek Lim
Adding one more, it implicitly leads individual contributors to give up
with challenging major things and just focus on minor things, which would
even help on project, but not in the long run. We don't have roadmap put
into wall and let whole community share the load together, so individual
contributors have lots of risks on putting major efforts - shouldn't
conflict to what others have been doing privately, should be accepted after
putting numerous effort to design and have POC.

2019년 2월 27일 (수) 오전 8:14, Jungtaek Lim 님이 작성:

> Thanks Sean, as always, to share your thought quickly!
>
> I agree most of points, except "they add a lot of code and complexity
> relative to benefit", since no one can weigh on something before at least
> taking quick review. IMHO if someone would think so, better to speak (I
> know it's hard and being a chance to be blamed, but better than giving
> meaningless hope, yeah I admit I might be another one to do that in another
> project) and see how others will weigh, rather than let it put aside and
> ignore.
>
> I guess my target is already simple and targeted since I've only mentioned
> SS area - there're not much committers who can review SS area. Thing to
> consider is, I have PRs in other areas as well, and I don't have issue on
> these areas. The reason of posting it to dev mailing list instead of
> periodically ping in Github PR is, 1) ping in PR just doesn't work 2) let
> others - hopefully PMC members - indicate a lack on activity on SS area,
> and lead some action.
>
> 2019년 2월 27일 (수) 오전 7:57, Sean Owen 님이 작성:
>
>> Those aren't bad changes, but they add a lot of code and complexity
>> relative to benefit. I think it's positive that you've gotten people
>> to spend time reviewing them, quite a lot. I don't know whether they
>> should be merged. This isn't a 'bug' though; not all changes should be
>> committed. Simple and targeted is much easier to say yes to, because
>> you implicitly here ask a lot of people to assume responsibility for
>> your change.
>>
>> On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim  wrote:
>> >
>> > Hi devs,
>> >
>> > sorry to bring this again to mailing list, but you know, ping in Github
>> PR just doesn't work.
>> >
>> > I have long-stand (created in last year) PRs on SS area which already
>> got over 100 comments (so community and me already put lots of efforts) but
>> no progress in point of view for being merged unfortunately lack of
>> committers' attention.
>> >
>> > - SPARK-20568 [1] : Provide option to clean up completed files in
>> streaming query
>> > - SPARK-25151 [2] : Apply Apache Commons Pool to KafkaDataConsumer
>> >
>> > According to my experiences on previous PRs (including other areas), it
>> won't take more than 1 months regardless of size of code diff to merge once
>> committer(s) gave a focus on PR and reviewed.
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>> >
>> > ps. I may agree all committers in SS area could be busy (It might
>> clearly represent SS area lacks committers), but I may not agree they're
>> involved in DSv2 and DSv2 is the first thing to focus. I haven't seen
>> anyone in participants on DSv2 discussions, and most of PRs in SS area is
>> parallel to DSv2 so I'm wondering why we try to couple SS area with DSv2
>> and restrict its evolution.
>> >
>> > ps2. Some of above is the part of previous mail thread regarding "Plan
>> on Structured Streaming in next major/minor release?" [3]
>> >
>> > I'm sure I still would like to address other items in the list (or
>> new), but without fast feedback it would not be possible. (Maintaining
>> multiple of long-lasting PRs make contributors very tired, and sometimes
>> worse than giving -1 and providing reason to reject.)
>> >
>> > 1. https://github.com/apache/spark/pull/22952
>> > 2. https://github.com/apache/spark/pull/22138
>> > 3.
>> https://lists.apache.org/thread.html/e6c8a530c998c4a2bb12b167f815d3726d155ce722047957e32689df@%3Cdev.spark.apache.org%3E
>>
>


Re: Request review for long-standing PRs

2019-02-26 Thread Jungtaek Lim
Thanks Sean, as always, to share your thought quickly!

I agree most of points, except "they add a lot of code and complexity
relative to benefit", since no one can weigh on something before at least
taking quick review. IMHO if someone would think so, better to speak (I
know it's hard and being a chance to be blamed, but better than giving
meaningless hope, yeah I admit I might be another one to do that in another
project) and see how others will weigh, rather than let it put aside and
ignore.

I guess my target is already simple and targeted since I've only mentioned
SS area - there're not much committers who can review SS area. Thing to
consider is, I have PRs in other areas as well, and I don't have issue on
these areas. The reason of posting it to dev mailing list instead of
periodically ping in Github PR is, 1) ping in PR just doesn't work 2) let
others - hopefully PMC members - indicate a lack on activity on SS area,
and lead some action.

2019년 2월 27일 (수) 오전 7:57, Sean Owen 님이 작성:

> Those aren't bad changes, but they add a lot of code and complexity
> relative to benefit. I think it's positive that you've gotten people
> to spend time reviewing them, quite a lot. I don't know whether they
> should be merged. This isn't a 'bug' though; not all changes should be
> committed. Simple and targeted is much easier to say yes to, because
> you implicitly here ask a lot of people to assume responsibility for
> your change.
>
> On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim  wrote:
> >
> > Hi devs,
> >
> > sorry to bring this again to mailing list, but you know, ping in Github
> PR just doesn't work.
> >
> > I have long-stand (created in last year) PRs on SS area which already
> got over 100 comments (so community and me already put lots of efforts) but
> no progress in point of view for being merged unfortunately lack of
> committers' attention.
> >
> > - SPARK-20568 [1] : Provide option to clean up completed files in
> streaming query
> > - SPARK-25151 [2] : Apply Apache Commons Pool to KafkaDataConsumer
> >
> > According to my experiences on previous PRs (including other areas), it
> won't take more than 1 months regardless of size of code diff to merge once
> committer(s) gave a focus on PR and reviewed.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > ps. I may agree all committers in SS area could be busy (It might
> clearly represent SS area lacks committers), but I may not agree they're
> involved in DSv2 and DSv2 is the first thing to focus. I haven't seen
> anyone in participants on DSv2 discussions, and most of PRs in SS area is
> parallel to DSv2 so I'm wondering why we try to couple SS area with DSv2
> and restrict its evolution.
> >
> > ps2. Some of above is the part of previous mail thread regarding "Plan
> on Structured Streaming in next major/minor release?" [3]
> >
> > I'm sure I still would like to address other items in the list (or new),
> but without fast feedback it would not be possible. (Maintaining multiple
> of long-lasting PRs make contributors very tired, and sometimes worse than
> giving -1 and providing reason to reject.)
> >
> > 1. https://github.com/apache/spark/pull/22952
> > 2. https://github.com/apache/spark/pull/22138
> > 3.
> https://lists.apache.org/thread.html/e6c8a530c998c4a2bb12b167f815d3726d155ce722047957e32689df@%3Cdev.spark.apache.org%3E
>


Re: Request review for long-standing PRs

2019-02-26 Thread Sean Owen
Those aren't bad changes, but they add a lot of code and complexity
relative to benefit. I think it's positive that you've gotten people
to spend time reviewing them, quite a lot. I don't know whether they
should be merged. This isn't a 'bug' though; not all changes should be
committed. Simple and targeted is much easier to say yes to, because
you implicitly here ask a lot of people to assume responsibility for
your change.

On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim  wrote:
>
> Hi devs,
>
> sorry to bring this again to mailing list, but you know, ping in Github PR 
> just doesn't work.
>
> I have long-stand (created in last year) PRs on SS area which already got 
> over 100 comments (so community and me already put lots of efforts) but no 
> progress in point of view for being merged unfortunately lack of committers' 
> attention.
>
> - SPARK-20568 [1] : Provide option to clean up completed files in streaming 
> query
> - SPARK-25151 [2] : Apply Apache Commons Pool to KafkaDataConsumer
>
> According to my experiences on previous PRs (including other areas), it won't 
> take more than 1 months regardless of size of code diff to merge once 
> committer(s) gave a focus on PR and reviewed.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps. I may agree all committers in SS area could be busy (It might clearly 
> represent SS area lacks committers), but I may not agree they're involved in 
> DSv2 and DSv2 is the first thing to focus. I haven't seen anyone in 
> participants on DSv2 discussions, and most of PRs in SS area is parallel to 
> DSv2 so I'm wondering why we try to couple SS area with DSv2 and restrict its 
> evolution.
>
> ps2. Some of above is the part of previous mail thread regarding "Plan on 
> Structured Streaming in next major/minor release?" [3]
>
> I'm sure I still would like to address other items in the list (or new), but 
> without fast feedback it would not be possible. (Maintaining multiple of 
> long-lasting PRs make contributors very tired, and sometimes worse than 
> giving -1 and providing reason to reject.)
>
> 1. https://github.com/apache/spark/pull/22952
> 2. https://github.com/apache/spark/pull/22138
> 3. 
> https://lists.apache.org/thread.html/e6c8a530c998c4a2bb12b167f815d3726d155ce722047957e32689df@%3Cdev.spark.apache.org%3E

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xiangrui Meng
In case there are issues visiting Google doc, I attached PDF files to the
JIRA.

On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang  wrote:

> Hi all,
>
> I want send a revised SPIP on implementing Accelerator(GPU)-aware
> Scheduling. It improves Spark by making it aware of GPUs exposed by cluster
> managers, and hence Spark can match GPU resources with user task requests
> properly. If you have scenarios that need to run workloads(DL/ML/Signal
> Processing etc.) on Spark cluster with GPU nodes, please help review and
> check how it fits into your use cases. Your feedback would be greatly
> appreciated!
>
> # Links to SPIP and Product doc:
>
> * Jira issue for the SPIP:
> https://issues.apache.org/jira/browse/SPARK-24615
> * Google Doc:
> https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
> * Product Doc:
> https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing
>
> Thank you!
>
> Xingbo
>


Request review for long-standing PRs

2019-02-26 Thread Jungtaek Lim
Hi devs,

sorry to bring this again to mailing list, but you know, ping in Github PR
just doesn't work.

I have long-stand (created in last year) PRs on SS area which already got
over 100 comments (so community and me already put lots of efforts) but no
progress in point of view for being merged unfortunately lack of
committers' attention.

- SPARK-20568 [1] : Provide option to clean up completed files in streaming
query
- SPARK-25151 [2] : Apply Apache Commons Pool to KafkaDataConsumer

According to my experiences on previous PRs (including other areas), it
won't take more than 1 months regardless of size of code diff to merge once
committer(s) gave a focus on PR and reviewed.

Thanks,
Jungtaek Lim (HeartSaVioR)

ps. I may agree all committers in SS area could be busy (It might clearly
represent SS area lacks committers), but I may not agree they're involved
in DSv2 and DSv2 is the first thing to focus. I haven't seen anyone in
participants on DSv2 discussions, and most of PRs in SS area is parallel to
DSv2 so I'm wondering why we try to couple SS area with DSv2 and restrict
its evolution.

ps2. Some of above is the part of previous mail thread regarding "Plan on
Structured Streaming in next major/minor release?" [3]

I'm sure I still would like to address other items in the list (or new),
but without fast feedback it would not be possible. (Maintaining multiple
of long-lasting PRs make contributors very tired, and sometimes worse than
giving -1 and providing reason to reject.)

1. https://github.com/apache/spark/pull/22952
2. https://github.com/apache/spark/pull/22138
3.
https://lists.apache.org/thread.html/e6c8a530c998c4a2bb12b167f815d3726d155ce722047957e32689df@%3Cdev.spark.apache.org%3E


Re: [build system] jenkins pull request builds not triggering

2019-02-26 Thread shane knapp
jenkins is churning through a lot of github updates, and i'm finally seeing
the backlog of pull requests builds starting.

i'll keep an eye on things over the afternoon.



On Tue, Feb 26, 2019 at 12:26 PM shane knapp  wrote:

> restarted jenkins, staring at logs.  will report back when things look
> good.
>
> On Tue, Feb 26, 2019 at 12:22 PM shane knapp  wrote:
>
>> investigating, and this will most likely require a jenkins restart.
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SS] Allowing stream Sink metadata as part of checkpoint?

2019-02-26 Thread Jungtaek Lim
I understand the reason about storing information along with data for
transactional committing, but it mostly makes sense if we store outputs
along with all necessary checkpoint information via transactional manner.
Spark doesn't store query checkpoint along with outputs.

I feel this is regarding who is owner of sink metadata - query, or sink
output. If metadata contains batch id and being used for deduplication, it
seems to be more coupled with query, and it will be a disaster to co-use
metadata in other query, or when checkpoint and sink metadata doesn't sync.

One example is SPARK-26411 [1] as I mentioned - if we remove checkpoint as
suggested and rerun the query with same sink output then data loss happens.
Deduplication is based on the fact the output of each batch will be always
same, but when we remove checkpoint the fact is simply broken.

I'm also seeing the other case sink output should be owner of sink
metadata, FileStreamSink, but due to the nature of this, metadata can't be
purged (as query cannot purge it - don't know which queries are also
accessing) and some of end users who suffered on growing metadata reported
to the issue SPARK-24295 [2].

So I'm considering both things together and finding a way to deal with them.

-Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-26411
2. https://issues.apache.org/jira/browse/SPARK-24295

2019년 2월 26일 (화) 오후 4:28, Arun Mahadevan 님이 작성:

> Unless its some sink metadata to be maintained by the framework (e.g sink
> state that needs to be passed back to the sink etc), would it make sense
> to keep it under the checkpoint dir ?
>
> Maybe I am missing the motivation of the proposed approach but I guess
> the sink mostly needs to store the last seen batchId to discard duplicate
> data during a batch replay. It would be ideal
> for the sink to store this information in the external store (along with
> the data) for de-duplication to work correctly.
>
> Thanks,
> Arun
>
>
>
> On Mon, 25 Feb 2019 at 22:13, Jungtaek Lim  wrote:
>
>> Hi devs,
>>
>> I was about to give it a try, but it would relate to DSv2 so decide to
>> initiate new thread before actual work. I also don't think this should be
>> along with DSv2 discussion since the change would be minor.
>>
>> While dealing with SPARK-24295 [1] and SPARK-26411 [2], I feel the needs
>> of participating sink metadata into checkpoint directory, but unlike source
>> which metadata directory is provided as subdirectory of checkpoint
>> directory, sink doesn't receive its own metadata directory.
>>
>> For example, FileStreamSink creates metadata directory on output
>> directory - though it is a bit intentional to share between queries - but
>> sometimes we may want to make it coupled with query checkpoint.
>>
>> What do you think about passing metadata path to sink (we have only one
>> for query) so that sink metadata can be coupled with query checkpoint?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24295
>> 2. https://issues.apache.org/jira/browse/SPARK-26411
>>
>>


Re: [build system] jenkins pull request builds not triggering

2019-02-26 Thread shane knapp
restarted jenkins, staring at logs.  will report back when things look good.

On Tue, Feb 26, 2019 at 12:22 PM shane knapp  wrote:

> investigating, and this will most likely require a jenkins restart.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: PR tests not running?

2019-02-26 Thread shane knapp
yeah, i'm on it.

On Tue, Feb 26, 2019 at 11:39 AM Xiao Li  wrote:

> Thanks for reporting it! It sounds like Shane is working on it. I manually
> triggered the test for the PR https://github.com/apache/spark/pull/23894
> .
>
> Cheers,
>
> Xiao
>
>
> Bruce Robbins  于2019年2月26日周二 上午11:33写道:
>
>> Sorry for stating what is likely obvious, but PR tests don't appear to be
>> running. Last one started was around 2AM.
>>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins pull request builds not triggering

2019-02-26 Thread shane knapp
investigating, and this will most likely require a jenkins restart.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: PR tests not running?

2019-02-26 Thread Xiao Li
Thanks for reporting it! It sounds like Shane is working on it. I manually
triggered the test for the PR https://github.com/apache/spark/pull/23894 .

Cheers,

Xiao


Bruce Robbins  于2019年2月26日周二 上午11:33写道:

> Sorry for stating what is likely obvious, but PR tests don't appear to be
> running. Last one started was around 2AM.
>


PR tests not running?

2019-02-26 Thread Bruce Robbins
Sorry for stating what is likely obvious, but PR tests don't appear to be
running. Last one started was around 2AM.


Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-26 Thread Ryan Blue
Hi everyone,

With 12 +1 votes and no +0 or -1 votes, this SPIP passes. Thanks to
everyone that participated in the discussions and voted!

rb

On Thu, Feb 21, 2019 at 12:14 AM Xiao Li  wrote:

> +1 This is in the right direction. The resolution rules and catalog APIs
> need more discussion when we implement it.
>
> In the current stage, we can disallow the runtime creation of the catalog.
> This will complicate the name resolution in a multi-session environment.
> For example, when one user creates a catalog in one session, the other
> users' queries might return different results because the tables are
> differently resolved. We might need to investigate how the other systems
> deal with this.
>
> Cheers,
>
> Xiao
>
> Takeshi Yamamuro  于2019年2月20日周三 上午12:54写道:
>
>> +1
>>
>> On Wed, Feb 20, 2019 at 4:59 PM JackyLee  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xingbo Jiang
Hi all,

I want send a revised SPIP on implementing Accelerator(GPU)-aware
Scheduling. It improves Spark by making it aware of GPUs exposed by cluster
managers, and hence Spark can match GPU resources with user task requests
properly. If you have scenarios that need to run workloads(DL/ML/Signal
Processing etc.) on Spark cluster with GPU nodes, please help review and
check how it fits into your use cases. Your feedback would be greatly
appreciated!

# Links to SPIP and Product doc:

* Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-24615
* Google Doc:
https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
* Product Doc:
https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing

Thank you!

Xingbo


Re: Thoughts on dataframe cogroup?

2019-02-26 Thread Li Jin
Thank you both for the reply. Chris and I have very similar use cases for
cogroup.

One of the goals for groupby apply + pandas UDF was to avoid things like
collect list and reshaping data between Spark and Pandas. Cogroup feels
very similar and can be an extension to the groupby apply + pandas UDF
functionality.

I wonder if any PMC/committers have any thoughts/opinions on this?

On Tue, Feb 26, 2019 at 2:17 AM  wrote:

> Just to add to this I’ve also implemented my own cogroup previously and
> would welcome a cogroup for datafame.
>
> My specific use case was that I had a large amount of time series data.
> Spark has very limited support for time series (specifically as-of joins),
> but pandas has good support.
>
> My solution was to take my two dataframes and perform a group by and
> collect list on each. The resulting arrays could be passed into a udf where
> they could be marshaled into a couple of pandas dataframes and processed
> using pandas excellent time series functionality.
>
> If cogroup was available natively on dataframes this would have been a bit
> nicer. The ideal would have been some pandas udf version of cogroup that
> gave me a pandas dataframe for each spark dataframe in the cogroup!
>
> Chris
>
> On 26 Feb 2019, at 00:38, Jonathan Winandy 
> wrote:
>
> For info, in our team have defined our own cogroup on dataframe in the
> past on different projects using different methods (rdd[row] based or union
> all collect list based).
>
> I might be biased, but find the approach very useful in project to
> simplify and speed up transformations, and remove a lot of intermediate
> stages (distinct + join => just cogroup).
>
> Plus spark 2.4 introduced a lot of new operator for nested data. That's a
> win!
>
>
> On Thu, 21 Feb 2019, 17:38 Li Jin,  wrote:
>
>> I am wondering do other people have opinion/use case on cogroup?
>>
>> On Wed, Feb 20, 2019 at 5:03 PM Li Jin  wrote:
>>
>>> Alessandro,
>>>
>>> Thanks for the reply. I assume by "equi-join", you mean "equality  full
>>> outer join" .
>>>
>>> Two issues I see with equity outer join is:
>>> (1) equity outer join will give n * m rows for each key (n and m being
>>> the corresponding number of rows in df1 and df2 for each key)
>>> (2) User needs to do some extra processing to transform n * m back to
>>> the desired shape (two sub dataframes with n and m rows)
>>>
>>> I think full outer join is an inefficient way to implement cogroup. If
>>> the end goal is to have two separate dataframes for each key, why joining
>>> them first and then unjoin them?
>>>
>>>
>>>
>>> On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando <
>>> alessandro.solima...@gmail.com> wrote:
>>>
 Hello,
 I fail to see how an equi-join on the key columns is different than the
 cogroup you propose.

 I think the accepted answer can shed some light:

 https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark

 Now you apply an udf on each iterable, one per key value (obtained with
 cogroup).

 You can achieve the same by:
 1) join df1 and df2 on the key you want,
 2) apply "groupby" on such key
 3) finally apply a udaf (you can have a look here if you are not
 familiar with them
 https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html),
 that will process each group "in isolation".

 HTH,
 Alessandro

 On Tue, 19 Feb 2019 at 23:30, Li Jin  wrote:

> Hi,
>
> We have been using Pyspark's groupby().apply() quite a bit and it has
> been very helpful in integrating Spark with our existing pandas-heavy
> libraries.
>
> Recently, we have found more and more cases where groupby().apply() is
> not sufficient - In some cases, we want to group two dataframes by the 
> same
> key, and apply a function which takes two pd.DataFrame (also returns a
> pd.DataFrame) for each key. This feels very much like the "cogroup"
> operation in the RDD API.
>
> It would be great to be able to do sth like this: (not actual API,
> just to explain the use case):
>
> @pandas_udf(return_schema, ...)
> def my_udf(pdf1, pdf2)
>  # pdf1 and pdf2 are the subset of the original dataframes that is
> associated with a particular key
>  result = ... # some code that uses pdf1 and pdf2
>  return result
>
> df3  = cogroup(df1, df2, key='some_key').apply(my_udf)
>
> I have searched around the problem and some people have suggested to
> join the tables first. However, it's often not the same pattern and hard 
> to
> get it to work by using joins.
>
> I wonder what are people's thought on this?
>
> Li
>
>