Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
In addition to logical plans, we need SQL support. That requires resolving
v2 tables from a catalog and a few other changes like separating v1 plans
from SQL parsing (see the earlier dev list thread). I’d also like to add
DDL operations for v2.

I think it also makes sense to add a new DF write API, as we discussed in
the sync as well. That way, users have an API to start moving to that
always uses the v2 plans and behavior.

Here are all the commands that we have implemented on top of the proposed
table catalog API. We should be able to get these working in upstream Spark
fairly quickly.

   - CREATE TABLE [IF NOT EXISTS] …
   - CREATE TABLE … PARTITIONED BY …
   - CREATE TABLE … AS SELECT …
   - CREATE TABLE LIKE
   - ALTER TABLE …
  - ADD COLUMNS …
  - DROP COLUMNS …
  - ALTER COLUMN … TYPE
  - ALTER COLUMN … COMMENT
  - RENAME COLUMN … TO …
  - SET TBLPROPERTIES …
  - UNSET TBLPROPERTIES …
   - ALTER TABLE … RENAME TO …
   - DROP TABLE [IF EXISTS] …
   - DESCRIBE [FORMATTED|EXTENDED] …
   - SHOW CREATE TABLE …
   - SHOW TBLPROPERTIES
   - ALTER TABLE
   - REFRESH TABLE …
   - INSERT INTO …
   - INSERT OVERWRITE …
   - DELETE FROM … WHERE …


On Thu, Feb 21, 2019 at 3:57 PM Matt Cheah  wrote:

> To evaluate the amount of work required to get Data Source V2 into Spark
> 3.0, we should have a list of all the specific SPIPs and patches that are
> pending that would constitute a successful and usable revamp of that API.
> Here are the ones I could find and know off the top of my head:
>
>1. Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252
>   1. In my opinion this is by far the most important API to get in,
>   but it’s also the most important API to give thorough thought and
>   evaluation.
>2. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE:
>https://issues.apache.org/jira/browse/SPARK-24923 +
>https://issues.apache.org/jira/browse/SPARK-24253
>3. Catalogs for other entities, such as functions. Pluggable system
>for loading these.
>4. Multi-Catalog support -
>https://issues.apache.org/jira/browse/SPARK-25006
>5. Migration of existing sources to V2, particularly file sources like
>Parquet and ORC – requires #1 as discussed in yesterday’s meeting
>
>
>
> Can someone add to this list if we’re missing anything? It might also make
> sense to either assigned a JIRA label or to update JIRA umbrella issues if
> any. Whatever mechanism works for being able to find all of these
> outstanding issues in one place.
>
>
>
> My understanding is that #1 is the most critical feature we need, and the
> feature that will go a long way towards allowing everything else to fall
> into place. #2 is also critical for external implementations of Data Source
> V2. I think we can afford to defer 3-5 to a future point release. But #1
> and #2 are also the features that have remained open for the longest time
> and we really need to move forward on these. Putting a target release for
> 3.0 will help in that regard.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Thursday, February 21, 2019 at 2:22 PM
> *To: *Matei Zaharia 
> *Cc: *Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> I'm all for making releases more often if we want. But this work could
> really use a target release to motivate getting it done. If we agree that
> it will block a release, then everyone is motivated to review and get the
> PRs in.
>
>
>
> If this work doesn't make it in the 3.0 release, I'm not confident that it
> will get done. Maybe we can have a release shortly after, but the timeline
> for these features -- that many of us need -- is nearly creeping into
> years. That's when alternatives start looking more likely to deliver. I'd
> rather see this work get in so we don't have to consider those
> alternatives, which is why I think this commitment is a good idea.
>
>
>
> I also would like to see multi-catalog support, but that is more
> reasonable to put off for a follow-up feature release, maybe 3.1.
>
>
>
> On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia 
> wrote:
>
> How large would the delay be? My 2 cents are that there’s nothing stopping
> us from making feature releases more often if we want to, so we shouldn’t
> see this as an “either delay 3.0 or release in >6 months” decision. If the
> work is likely to get in with a small delay and simplifies our work after
> 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it.
> But if it would be a large delay, we should also weigh it against other
> things that are going to get delayed if 3.0 moves much later.
>
> It might also be better to propose a specific date to delay until, so
> people can still plan around when the release branch will likely be cut.
>
> Matei
>
> > On Feb 21, 2019, at 1:03 PM, Ryan Blue 
> wrote:
> >
> > Hi everyone,
> >
> > In the DSv2 sync last night, we had a discussion about 

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matt Cheah
To evaluate the amount of work required to get Data Source V2 into Spark 3.0, 
we should have a list of all the specific SPIPs and patches that are pending 
that would constitute a successful and usable revamp of that API. Here are the 
ones I could find and know off the top of my head:
Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252
In my opinion this is by far the most important API to get in, but it’s also 
the most important API to give thorough thought and evaluation.
Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: 
https://issues.apache.org/jira/browse/SPARK-24923 + 
https://issues.apache.org/jira/browse/SPARK-24253
Catalogs for other entities, such as functions. Pluggable system for loading 
these.
Multi-Catalog support - https://issues.apache.org/jira/browse/SPARK-25006
Migration of existing sources to V2, particularly file sources like Parquet and 
ORC – requires #1 as discussed in yesterday’s meeting
 

Can someone add to this list if we’re missing anything? It might also make 
sense to either assigned a JIRA label or to update JIRA umbrella issues if any. 
Whatever mechanism works for being able to find all of these outstanding issues 
in one place.

 

My understanding is that #1 is the most critical feature we need, and the 
feature that will go a long way towards allowing everything else to fall into 
place. #2 is also critical for external implementations of Data Source V2. I 
think we can afford to defer 3-5 to a future point release. But #1 and #2 are 
also the features that have remained open for the longest time and we really 
need to move forward on these. Putting a target release for 3.0 will help in 
that regard.

 

-Matt Cheah

 

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Thursday, February 21, 2019 at 2:22 PM
To: Matei Zaharia 
Cc: Spark Dev List 
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

I'm all for making releases more often if we want. But this work could really 
use a target release to motivate getting it done. If we agree that it will 
block a release, then everyone is motivated to review and get the PRs in. 

 

If this work doesn't make it in the 3.0 release, I'm not confident that it will 
get done. Maybe we can have a release shortly after, but the timeline for these 
features -- that many of us need -- is nearly creeping into years. That's when 
alternatives start looking more likely to deliver. I'd rather see this work get 
in so we don't have to consider those alternatives, which is why I think this 
commitment is a good idea.

 

I also would like to see multi-catalog support, but that is more reasonable to 
put off for a follow-up feature release, maybe 3.1.

 

On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia  wrote:

How large would the delay be? My 2 cents are that there’s nothing stopping us 
from making feature releases more often if we want to, so we shouldn’t see this 
as an “either delay 3.0 or release in >6 months” decision. If the work is 
likely to get in with a small delay and simplifies our work after 3.0 (e.g. we 
can get rid of older APIs), then the delay may be worth it. But if it would be 
a large delay, we should also weigh it against other things that are going to 
get delayed if 3.0 moves much later.

It might also be better to propose a specific date to delay until, so people 
can still plan around when the release branch will likely be cut.

Matei

> On Feb 21, 2019, at 1:03 PM, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> In the DSv2 sync last night, we had a discussion about roadmap and what the 
> goal should be for getting the main features into Spark. We all agreed that 
> 3.0 should be that goal, even if it means delaying the 3.0 release.
> 
> The possibility of delaying the 3.0 release may be controversial, so I want 
> to bring it up to the dev list to build consensus around it. The rationale 
> for this is partly that much of this work has been outstanding for more than 
> a year now. If it doesn't make it into 3.0, then it would be another 6 months 
> before it would be in a release, and would be nearing 2 years to get the work 
> done.
> 
> Are there any objections to targeting 3.0 for this?
> 
> In addition, much of the planning for multi-catalog support has been done to 
> make v2 possible. Do we also want to include multi-catalog support?
> 
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


 

-- 

Ryan Blue 

Software Engineer

Netflix



smime.p7s
Description: S/MIME cryptographic signature


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
I'm all for making releases more often if we want. But this work could
really use a target release to motivate getting it done. If we agree that
it will block a release, then everyone is motivated to review and get the
PRs in.

If this work doesn't make it in the 3.0 release, I'm not confident that it
will get done. Maybe we can have a release shortly after, but the timeline
for these features -- that many of us need -- is nearly creeping into
years. That's when alternatives start looking more likely to deliver. I'd
rather see this work get in so we don't have to consider those
alternatives, which is why I think this commitment is a good idea.

I also would like to see multi-catalog support, but that is more reasonable
to put off for a follow-up feature release, maybe 3.1.

On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia 
wrote:

> How large would the delay be? My 2 cents are that there’s nothing stopping
> us from making feature releases more often if we want to, so we shouldn’t
> see this as an “either delay 3.0 or release in >6 months” decision. If the
> work is likely to get in with a small delay and simplifies our work after
> 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it.
> But if it would be a large delay, we should also weigh it against other
> things that are going to get delayed if 3.0 moves much later.
>
> It might also be better to propose a specific date to delay until, so
> people can still plan around when the release branch will likely be cut.
>
> Matei
>
> > On Feb 21, 2019, at 1:03 PM, Ryan Blue 
> wrote:
> >
> > Hi everyone,
> >
> > In the DSv2 sync last night, we had a discussion about roadmap and what
> the goal should be for getting the main features into Spark. We all agreed
> that 3.0 should be that goal, even if it means delaying the 3.0 release.
> >
> > The possibility of delaying the 3.0 release may be controversial, so I
> want to bring it up to the dev list to build consensus around it. The
> rationale for this is partly that much of this work has been outstanding
> for more than a year now. If it doesn't make it into 3.0, then it would be
> another 6 months before it would be in a release, and would be nearing 2
> years to get the work done.
> >
> > Are there any objections to targeting 3.0 for this?
> >
> > In addition, much of the planning for multi-catalog support has been
> done to make v2 possible. Do we also want to include multi-catalog support?
> >
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matei Zaharia
How large would the delay be? My 2 cents are that there’s nothing stopping us 
from making feature releases more often if we want to, so we shouldn’t see this 
as an “either delay 3.0 or release in >6 months” decision. If the work is 
likely to get in with a small delay and simplifies our work after 3.0 (e.g. we 
can get rid of older APIs), then the delay may be worth it. But if it would be 
a large delay, we should also weigh it against other things that are going to 
get delayed if 3.0 moves much later.

It might also be better to propose a specific date to delay until, so people 
can still plan around when the release branch will likely be cut.

Matei

> On Feb 21, 2019, at 1:03 PM, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> In the DSv2 sync last night, we had a discussion about roadmap and what the 
> goal should be for getting the main features into Spark. We all agreed that 
> 3.0 should be that goal, even if it means delaying the 3.0 release.
> 
> The possibility of delaying the 3.0 release may be controversial, so I want 
> to bring it up to the dev list to build consensus around it. The rationale 
> for this is partly that much of this work has been outstanding for more than 
> a year now. If it doesn't make it into 3.0, then it would be another 6 months 
> before it would be in a release, and would be nearing 2 years to get the work 
> done.
> 
> Are there any objections to targeting 3.0 for this?
> 
> In addition, much of the planning for multi-catalog support has been done to 
> make v2 possible. Do we also want to include multi-catalog support?
> 
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
Hi everyone,

In the DSv2 sync last night, we had a discussion about roadmap and what the
goal should be for getting the main features into Spark. We all agreed that
3.0 should be that goal, even if it means delaying the 3.0 release.

The possibility of delaying the 3.0 release may be controversial, so I want
to bring it up to the dev list to build consensus around it. The rationale
for this is partly that much of this work has been outstanding for more
than a year now. If it doesn't make it into 3.0, then it would be another 6
months before it would be in a release, and would be nearing 2 years to get
the work done.

Are there any objections to targeting 3.0 for this?

In addition, much of the planning for multi-catalog support has been done
to make v2 possible. Do we also want to include multi-catalog support?


rb

-- 
Ryan Blue
Software Engineer
Netflix


DataSourceV2 sync notes - 20 Feb 2019

2019-02-21 Thread Ryan Blue
Here are my notes from the DSv2 sync last night. As always, if you have
corrections, please reply with them. And if you’d like to be included on
the invite to participate in the next sync (6 March), send me an email.

Here’s a quick summary of the topics where we had consensus last night:

   - The behavior of v1 sources needs to be documented to come up with a
   migration plan
   - Spark 3.0 should include DSv2, even if it would delay the release
   (pending community discussion and vote)
   - Design for the v2 Catalog plugin system
   - V2 catalog approach of separate TableCatalog, FunctionCatalog, and
   ViewCatalog interfaces
   - Common v2 Table metadata should be schema, partitioning, and
   string-map of properties; leaving out sorting for now. (Ready to vote on
   metadata SPIP.)

*Topics*:

   - Issues raised by ORC v2 commit
   - Migration to v2 sources
   - Roadmap and current blockers
   - Catalog plugin system
   - Catalog API separate interfaces approach
   - Catalog API metadata (schema, partitioning, and properties)
   - Public catalog API proposal

*Notes*:

   - Issues raised by ORC v2 commit
  - Ryan: Disabled change to use v2 by default in PR for overwrite
  plans: tests rely on CTAS, which is not implemented in v2.
  - Wenchen: suggested using a StagedTable to work around not having a
  CTAS finished. TableProvider could create a staged table.
  - Ryan: Using StagedTable doesn’t make sense to me. It was intended
  to solve a different problem (atomicity). Adding an interface to create a
  staged table either requires the same metadata as CTAS or
requires a blank
  staged table, which isn’t the same concept: these staged tables would
  behave entirely differently than the ones for atomic operations.
Better to
  spend time getting CTAS done and work through the long-term plan than to
  hack around it.
  - Second issue raised by the ORC work: how to support tables that use
  different validations.
  - Ryan: What Gengliang’s PRs are missing is a clear definition of
  what tables require different validation and what that validation should
  be. In some cases, CTAS is validated against existing data [Ed: this is
  PreprocessTableCreation] and in some cases, Append has no validation
  because the table doesn’t exist. What isn’t clear is when these
validations
  are applied.
  - Ryan: Without knowing exactly how v1 works, we can’t mirror that
  behavior in v2. Building a way to turn off validation is going to be
  needed, but is insufficient without knowing when to apply it.
  - Ryan: We also don’t know if it will make sense to maintain all of
  these rules to mimic v1 behavior. In v1, CTAS and Append can
both write to
  existing tables, but use different rules to validate. What are the
  differences between them? It is unlikely that Spark will support both as
  options, if that is even possible. [Ed: see later discussion on migration
  that continues this.]
  - Gengliang: Using SaveMode is an option.
  - Ryan: Using SaveMode only appears to fix this, but doesn’t actually
  test v2. Using SaveMode appears to work because it disables all
validation
  and uses code from v1 that will “create” tables by writing. But
this isn’t
  helpful for the v2 goal of having defined and reliable behavior.
  - Gengliang: SaveMode is not correctly translated. Append could mean
  AppendData or CTAS.
  - Ryan: This is why we need to focus on finishing the v2 plans: so we
  can correctly translate the SaveMode into the right plan. That depends on
  having a catalog for CTAS and to check the existence of a table.
  - Wenchen: Catalog doesn’t support path tables, so how does this help?
  - Ryan: The multi-catalog identifiers proposal includes a way to pass
  paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows a
  catalog implementation to handle path-based tables. The identifier will
  also have a method to test whether the identifier is a path
identifier and
  catalogs are not required to support path identifiers.
   - Migration to v2 sources
  - Hyukjin: Once the ORC upgrade is done how will we move from v1 to
  v2?
  - Ryan: We will need to develop v1 and v2 in parallel. There are many
  code paths in v1 and we don’t know exactly what they do. We first need to
  know what they do and make a migration plan after that.
  - Hyukjin: What if there are many behavior differences? Will this
  require an API to opt in for each one?
  - Ryan: Without knowing how v1 behaves, we can only speculate. But I
  don’t think that we will want to support many of these special
cases. That
  is a lot of work and maintenance.
  - Gengliang: When can we change the default to v2? Until we change
  the default, v2 is not tested. The v2 work is blocked by this.
  - Ryan: v2 work should not be 

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread DB Tsai
I am cutting a new rc4 with fix from Felix. Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0359BC9965359766

On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung  wrote:
>
> I merged the fix to 2.4.
>
>
> 
> From: Felix Cheung 
> Sent: Wednesday, February 20, 2019 9:34 PM
> To: DB Tsai; Spark dev list
> Cc: Cesar Delgado
> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>
> Could you hold for a bit - I have one more fix to get in
>
>
> 
> From: d_t...@apple.com on behalf of DB Tsai 
> Sent: Wednesday, February 20, 2019 12:25 PM
> To: Spark dev list
> Cc: Cesar Delgado
> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>
> Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
>
> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>
> > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> > wrote:
> >
> > Just wanted to point out that
> > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> > and is marked as a correctness bug. (The fix is in the 2.4 branch,
> > just not in rc2.)
> >
> > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
> >>
> >> Please vote on releasing the following candidate as Apache Spark version 
> >> 2.4.1.
> >>
> >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes 
> >> are cast, with
> >> a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.4.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.4.1-rc2 (commit 
> >> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
> >> https://github.com/apache/spark/tree/v2.4.1-rc2
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1299/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
> >>
> >> The list of bug fixes going into 2.4.1 can be found at the following URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.4.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.4.1 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> >> Version/s" = 2.4.1
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
> >>
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Felix Cheung
I merged the fix to 2.4.



From: Felix Cheung 
Sent: Wednesday, February 20, 2019 9:34 PM
To: DB Tsai; Spark dev list
Cc: Cesar Delgado
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Could you hold for a bit - I have one more fix to get in



From: d_t...@apple.com on behalf of DB Tsai 
Sent: Wednesday, February 20, 2019 12:25 PM
To: Spark dev list
Cc: Cesar Delgado
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.

DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc

> On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
> wrote:
>
> Just wanted to point out that
> https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> and is marked as a correctness bug. (The fix is in the 2.4 branch,
> just not in rc2.)
>
> On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.1.
>>
>> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are 
>> cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.1-rc2 (commit 
>> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> https://github.com/apache/spark/tree/v2.4.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1299/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>>
>> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.1?
>> ===
>>
>> The current list of open tickets targeted at 2.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 2.4.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Thoughts on dataframe cogroup?

2019-02-21 Thread Li Jin
I am wondering do other people have opinion/use case on cogroup?

On Wed, Feb 20, 2019 at 5:03 PM Li Jin  wrote:

> Alessandro,
>
> Thanks for the reply. I assume by "equi-join", you mean "equality  full
> outer join" .
>
> Two issues I see with equity outer join is:
> (1) equity outer join will give n * m rows for each key (n and m being the
> corresponding number of rows in df1 and df2 for each key)
> (2) User needs to do some extra processing to transform n * m back to the
> desired shape (two sub dataframes with n and m rows)
>
> I think full outer join is an inefficient way to implement cogroup. If the
> end goal is to have two separate dataframes for each key, why joining them
> first and then unjoin them?
>
>
>
> On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hello,
>> I fail to see how an equi-join on the key columns is different than the
>> cogroup you propose.
>>
>> I think the accepted answer can shed some light:
>>
>> https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark
>>
>> Now you apply an udf on each iterable, one per key value (obtained with
>> cogroup).
>>
>> You can achieve the same by:
>> 1) join df1 and df2 on the key you want,
>> 2) apply "groupby" on such key
>> 3) finally apply a udaf (you can have a look here if you are not familiar
>> with them
>> https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html),
>> that will process each group "in isolation".
>>
>> HTH,
>> Alessandro
>>
>> On Tue, 19 Feb 2019 at 23:30, Li Jin  wrote:
>>
>>> Hi,
>>>
>>> We have been using Pyspark's groupby().apply() quite a bit and it has
>>> been very helpful in integrating Spark with our existing pandas-heavy
>>> libraries.
>>>
>>> Recently, we have found more and more cases where groupby().apply() is
>>> not sufficient - In some cases, we want to group two dataframes by the same
>>> key, and apply a function which takes two pd.DataFrame (also returns a
>>> pd.DataFrame) for each key. This feels very much like the "cogroup"
>>> operation in the RDD API.
>>>
>>> It would be great to be able to do sth like this: (not actual API, just
>>> to explain the use case):
>>>
>>> @pandas_udf(return_schema, ...)
>>> def my_udf(pdf1, pdf2)
>>>  # pdf1 and pdf2 are the subset of the original dataframes that is
>>> associated with a particular key
>>>  result = ... # some code that uses pdf1 and pdf2
>>>  return result
>>>
>>> df3  = cogroup(df1, df2, key='some_key').apply(my_udf)
>>>
>>> I have searched around the problem and some people have suggested to
>>> join the tables first. However, it's often not the same pattern and hard to
>>> get it to work by using joins.
>>>
>>> I wonder what are people's thought on this?
>>>
>>> Li
>>>
>>>


Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Sean Owen
That looks like a change to restore some behavior that was removed in
2.2. It's not directly relevant to a release vote on 2.4.1. See the
existing discussion at
https://github.com/apache/spark/pull/22144#issuecomment-432258536 It
may indeed be a good thing to change but just continue the discussion
as you work on your PR.

On Thu, Feb 21, 2019 at 9:09 AM Parth Gandhi  wrote:
>
> Hello,
> In https://issues.apache.org/jira/browse/SPARK-24935, I am getting 
> requests from people that they were hoping for the fix to be merged in Spark 
> 2.4.1. The concerned PR is here: https://github.com/apache/spark/pull/23778. 
> I do not mind if we do not merge it for 2.4.1 and I do not want to be 
> considered this as a blocker for 2.4.1, but I would appreciate if somebody 
> can have a look and decide on the same. I have not written unit tests for the 
> PR yet, as I am relatively new to Catalyst and was hoping for someone to 
> confirm whether the fix is progressing in the right direction before 
> proceeding ahead with the unit tests. Thank you.
>
> Regards,
> Parth Kamlesh Gandhi
>
>
> On Wed, Feb 20, 2019 at 11:34 PM Felix Cheung  
> wrote:
>>
>> Could you hold for a bit - I have one more fix to get in
>>
>>
>> 
>> From: d_t...@apple.com on behalf of DB Tsai 
>> Sent: Wednesday, February 20, 2019 12:25 PM
>> To: Spark dev list
>> Cc: Cesar Delgado
>> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>>
>> Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple, Inc
>>
>> > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin  
>> > wrote:
>> >
>> > Just wanted to point out that
>> > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
>> > and is marked as a correctness bug. (The fix is in the 2.4 branch,
>> > just not in rc2.)
>> >
>> > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai  wrote:
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark version 
>> >> 2.4.1.
>> >>
>> >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes 
>> >> are cast, with
>> >> a minimum of 3 +1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 2.4.1
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>> >>
>> >> The tag to be voted on is v2.4.1-rc2 (commit 
>> >> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
>> >> https://github.com/apache/spark/tree/v2.4.1-rc2
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1299/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
>> >>
>> >> The list of bug fixes going into 2.4.1 can be found at the following URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >>
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 2.4.1?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 2.4.1 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> >> Version/s" = 2.4.1
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >>
>> >> DB 

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Parth Gandhi
Hello,
In https://issues.apache.org/jira/browse/SPARK-24935, I am getting
requests from people that they were hoping for the fix to be merged in
Spark 2.4.1. The concerned PR is here:
https://github.com/apache/spark/pull/23778. I do not mind if we do not
merge it for 2.4.1 and I do not want to be considered this as a blocker for
2.4.1, but I would appreciate if somebody can have a look and decide on the
same. I have not written unit tests for the PR yet, as I am relatively new
to Catalyst and was hoping for someone to confirm whether the fix is
progressing in the right direction before proceeding ahead with the unit
tests. Thank you.

Regards,
Parth Kamlesh Gandhi


On Wed, Feb 20, 2019 at 11:34 PM Felix Cheung 
wrote:

> Could you hold for a bit - I have one more fix to get in
>
>
> --
> *From:* d_t...@apple.com on behalf of DB Tsai 
> *Sent:* Wednesday, February 20, 2019 12:25 PM
> *To:* Spark dev list
> *Cc:* Cesar Delgado
> *Subject:* Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
>
> Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
>
> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple,
> Inc
>
> > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin 
> wrote:
> >
> > Just wanted to point out that
> > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> > and is marked as a correctness bug. (The fix is in the 2.4 branch,
> > just not in rc2.)
> >
> > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai 
> wrote:
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 2.4.1.
> >>
> >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes
> are cast, with
> >> a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.4.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.4.1-rc2 (commit
> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
> >> https://github.com/apache/spark/tree/v2.4.1-rc2
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1299/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
> >>
> >> The list of bug fixes going into 2.4.1 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.4.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.4.1 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.1
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple,
> Inc
> >>
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: 

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-21 Thread Xiao Li
+1 This is in the right direction. The resolution rules and catalog APIs
need more discussion when we implement it.

In the current stage, we can disallow the runtime creation of the catalog.
This will complicate the name resolution in a multi-session environment.
For example, when one user creates a catalog in one session, the other
users' queries might return different results because the tables are
differently resolved. We might need to investigate how the other systems
deal with this.

Cheers,

Xiao

Takeshi Yamamuro  于2019年2月20日周三 上午12:54写道:

> +1
>
> On Wed, Feb 20, 2019 at 4:59 PM JackyLee  wrote:
>
>> +1
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>