Re: [DISCUSS] Spark 3.0 and DataSourceV2
In addition to logical plans, we need SQL support. That requires resolving v2 tables from a catalog and a few other changes like separating v1 plans from SQL parsing (see the earlier dev list thread). I’d also like to add DDL operations for v2. I think it also makes sense to add a new DF write API, as we discussed in the sync as well. That way, users have an API to start moving to that always uses the v2 plans and behavior. Here are all the commands that we have implemented on top of the proposed table catalog API. We should be able to get these working in upstream Spark fairly quickly. - CREATE TABLE [IF NOT EXISTS] … - CREATE TABLE … PARTITIONED BY … - CREATE TABLE … AS SELECT … - CREATE TABLE LIKE - ALTER TABLE … - ADD COLUMNS … - DROP COLUMNS … - ALTER COLUMN … TYPE - ALTER COLUMN … COMMENT - RENAME COLUMN … TO … - SET TBLPROPERTIES … - UNSET TBLPROPERTIES … - ALTER TABLE … RENAME TO … - DROP TABLE [IF EXISTS] … - DESCRIBE [FORMATTED|EXTENDED] … - SHOW CREATE TABLE … - SHOW TBLPROPERTIES - ALTER TABLE - REFRESH TABLE … - INSERT INTO … - INSERT OVERWRITE … - DELETE FROM … WHERE … On Thu, Feb 21, 2019 at 3:57 PM Matt Cheah wrote: > To evaluate the amount of work required to get Data Source V2 into Spark > 3.0, we should have a list of all the specific SPIPs and patches that are > pending that would constitute a successful and usable revamp of that API. > Here are the ones I could find and know off the top of my head: > >1. Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252 > 1. In my opinion this is by far the most important API to get in, > but it’s also the most important API to give thorough thought and > evaluation. >2. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: >https://issues.apache.org/jira/browse/SPARK-24923 + >https://issues.apache.org/jira/browse/SPARK-24253 >3. Catalogs for other entities, such as functions. Pluggable system >for loading these. >4. Multi-Catalog support - >https://issues.apache.org/jira/browse/SPARK-25006 >5. Migration of existing sources to V2, particularly file sources like >Parquet and ORC – requires #1 as discussed in yesterday’s meeting > > > > Can someone add to this list if we’re missing anything? It might also make > sense to either assigned a JIRA label or to update JIRA umbrella issues if > any. Whatever mechanism works for being able to find all of these > outstanding issues in one place. > > > > My understanding is that #1 is the most critical feature we need, and the > feature that will go a long way towards allowing everything else to fall > into place. #2 is also critical for external implementations of Data Source > V2. I think we can afford to defer 3-5 to a future point release. But #1 > and #2 are also the features that have remained open for the longest time > and we really need to move forward on these. Putting a target release for > 3.0 will help in that regard. > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Thursday, February 21, 2019 at 2:22 PM > *To: *Matei Zaharia > *Cc: *Spark Dev List > *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2 > > > > I'm all for making releases more often if we want. But this work could > really use a target release to motivate getting it done. If we agree that > it will block a release, then everyone is motivated to review and get the > PRs in. > > > > If this work doesn't make it in the 3.0 release, I'm not confident that it > will get done. Maybe we can have a release shortly after, but the timeline > for these features -- that many of us need -- is nearly creeping into > years. That's when alternatives start looking more likely to deliver. I'd > rather see this work get in so we don't have to consider those > alternatives, which is why I think this commitment is a good idea. > > > > I also would like to see multi-catalog support, but that is more > reasonable to put off for a follow-up feature release, maybe 3.1. > > > > On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia > wrote: > > How large would the delay be? My 2 cents are that there’s nothing stopping > us from making feature releases more often if we want to, so we shouldn’t > see this as an “either delay 3.0 or release in >6 months” decision. If the > work is likely to get in with a small delay and simplifies our work after > 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. > But if it would be a large delay, we should also weigh it against other > things that are going to get delayed if 3.0 moves much later. > > It might also be better to propose a specific date to delay until, so > people can still plan around when the release branch will likely be cut. > > Matei > > > On Feb 21, 2019, at 1:03 PM, Ryan Blue > wrote: > > > > Hi everyone, > > > > In the DSv2 sync last night, we had a discussion about
Re: [DISCUSS] Spark 3.0 and DataSourceV2
To evaluate the amount of work required to get Data Source V2 into Spark 3.0, we should have a list of all the specific SPIPs and patches that are pending that would constitute a successful and usable revamp of that API. Here are the ones I could find and know off the top of my head: Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252 In my opinion this is by far the most important API to get in, but it’s also the most important API to give thorough thought and evaluation. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: https://issues.apache.org/jira/browse/SPARK-24923 + https://issues.apache.org/jira/browse/SPARK-24253 Catalogs for other entities, such as functions. Pluggable system for loading these. Multi-Catalog support - https://issues.apache.org/jira/browse/SPARK-25006 Migration of existing sources to V2, particularly file sources like Parquet and ORC – requires #1 as discussed in yesterday’s meeting Can someone add to this list if we’re missing anything? It might also make sense to either assigned a JIRA label or to update JIRA umbrella issues if any. Whatever mechanism works for being able to find all of these outstanding issues in one place. My understanding is that #1 is the most critical feature we need, and the feature that will go a long way towards allowing everything else to fall into place. #2 is also critical for external implementations of Data Source V2. I think we can afford to defer 3-5 to a future point release. But #1 and #2 are also the features that have remained open for the longest time and we really need to move forward on these. Putting a target release for 3.0 will help in that regard. -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, February 21, 2019 at 2:22 PM To: Matei Zaharia Cc: Spark Dev List Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 I'm all for making releases more often if we want. But this work could really use a target release to motivate getting it done. If we agree that it will block a release, then everyone is motivated to review and get the PRs in. If this work doesn't make it in the 3.0 release, I'm not confident that it will get done. Maybe we can have a release shortly after, but the timeline for these features -- that many of us need -- is nearly creeping into years. That's when alternatives start looking more likely to deliver. I'd rather see this work get in so we don't have to consider those alternatives, which is why I think this commitment is a good idea. I also would like to see multi-catalog support, but that is more reasonable to put off for a follow-up feature release, maybe 3.1. On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia wrote: How large would the delay be? My 2 cents are that there’s nothing stopping us from making feature releases more often if we want to, so we shouldn’t see this as an “either delay 3.0 or release in >6 months” decision. If the work is likely to get in with a small delay and simplifies our work after 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. But if it would be a large delay, we should also weigh it against other things that are going to get delayed if 3.0 moves much later. It might also be better to propose a specific date to delay until, so people can still plan around when the release branch will likely be cut. Matei > On Feb 21, 2019, at 1:03 PM, Ryan Blue wrote: > > Hi everyone, > > In the DSv2 sync last night, we had a discussion about roadmap and what the > goal should be for getting the main features into Spark. We all agreed that > 3.0 should be that goal, even if it means delaying the 3.0 release. > > The possibility of delaying the 3.0 release may be controversial, so I want > to bring it up to the dev list to build consensus around it. The rationale > for this is partly that much of this work has been outstanding for more than > a year now. If it doesn't make it into 3.0, then it would be another 6 months > before it would be in a release, and would be nearing 2 years to get the work > done. > > Are there any objections to targeting 3.0 for this? > > In addition, much of the planning for multi-catalog support has been done to > make v2 possible. Do we also want to include multi-catalog support? > > > rb > > -- > Ryan Blue > Software Engineer > Netflix -- Ryan Blue Software Engineer Netflix smime.p7s Description: S/MIME cryptographic signature
Re: [DISCUSS] Spark 3.0 and DataSourceV2
I'm all for making releases more often if we want. But this work could really use a target release to motivate getting it done. If we agree that it will block a release, then everyone is motivated to review and get the PRs in. If this work doesn't make it in the 3.0 release, I'm not confident that it will get done. Maybe we can have a release shortly after, but the timeline for these features -- that many of us need -- is nearly creeping into years. That's when alternatives start looking more likely to deliver. I'd rather see this work get in so we don't have to consider those alternatives, which is why I think this commitment is a good idea. I also would like to see multi-catalog support, but that is more reasonable to put off for a follow-up feature release, maybe 3.1. On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia wrote: > How large would the delay be? My 2 cents are that there’s nothing stopping > us from making feature releases more often if we want to, so we shouldn’t > see this as an “either delay 3.0 or release in >6 months” decision. If the > work is likely to get in with a small delay and simplifies our work after > 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. > But if it would be a large delay, we should also weigh it against other > things that are going to get delayed if 3.0 moves much later. > > It might also be better to propose a specific date to delay until, so > people can still plan around when the release branch will likely be cut. > > Matei > > > On Feb 21, 2019, at 1:03 PM, Ryan Blue > wrote: > > > > Hi everyone, > > > > In the DSv2 sync last night, we had a discussion about roadmap and what > the goal should be for getting the main features into Spark. We all agreed > that 3.0 should be that goal, even if it means delaying the 3.0 release. > > > > The possibility of delaying the 3.0 release may be controversial, so I > want to bring it up to the dev list to build consensus around it. The > rationale for this is partly that much of this work has been outstanding > for more than a year now. If it doesn't make it into 3.0, then it would be > another 6 months before it would be in a release, and would be nearing 2 > years to get the work done. > > > > Are there any objections to targeting 3.0 for this? > > > > In addition, much of the planning for multi-catalog support has been > done to make v2 possible. Do we also want to include multi-catalog support? > > > > > > rb > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix
Re: [DISCUSS] Spark 3.0 and DataSourceV2
How large would the delay be? My 2 cents are that there’s nothing stopping us from making feature releases more often if we want to, so we shouldn’t see this as an “either delay 3.0 or release in >6 months” decision. If the work is likely to get in with a small delay and simplifies our work after 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. But if it would be a large delay, we should also weigh it against other things that are going to get delayed if 3.0 moves much later. It might also be better to propose a specific date to delay until, so people can still plan around when the release branch will likely be cut. Matei > On Feb 21, 2019, at 1:03 PM, Ryan Blue wrote: > > Hi everyone, > > In the DSv2 sync last night, we had a discussion about roadmap and what the > goal should be for getting the main features into Spark. We all agreed that > 3.0 should be that goal, even if it means delaying the 3.0 release. > > The possibility of delaying the 3.0 release may be controversial, so I want > to bring it up to the dev list to build consensus around it. The rationale > for this is partly that much of this work has been outstanding for more than > a year now. If it doesn't make it into 3.0, then it would be another 6 months > before it would be in a release, and would be nearing 2 years to get the work > done. > > Are there any objections to targeting 3.0 for this? > > In addition, much of the planning for multi-catalog support has been done to > make v2 possible. Do we also want to include multi-catalog support? > > > rb > > -- > Ryan Blue > Software Engineer > Netflix - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[DISCUSS] Spark 3.0 and DataSourceV2
Hi everyone, In the DSv2 sync last night, we had a discussion about roadmap and what the goal should be for getting the main features into Spark. We all agreed that 3.0 should be that goal, even if it means delaying the 3.0 release. The possibility of delaying the 3.0 release may be controversial, so I want to bring it up to the dev list to build consensus around it. The rationale for this is partly that much of this work has been outstanding for more than a year now. If it doesn't make it into 3.0, then it would be another 6 months before it would be in a release, and would be nearing 2 years to get the work done. Are there any objections to targeting 3.0 for this? In addition, much of the planning for multi-catalog support has been done to make v2 possible. Do we also want to include multi-catalog support? rb -- Ryan Blue Software Engineer Netflix
DataSourceV2 sync notes - 20 Feb 2019
Here are my notes from the DSv2 sync last night. As always, if you have corrections, please reply with them. And if you’d like to be included on the invite to participate in the next sync (6 March), send me an email. Here’s a quick summary of the topics where we had consensus last night: - The behavior of v1 sources needs to be documented to come up with a migration plan - Spark 3.0 should include DSv2, even if it would delay the release (pending community discussion and vote) - Design for the v2 Catalog plugin system - V2 catalog approach of separate TableCatalog, FunctionCatalog, and ViewCatalog interfaces - Common v2 Table metadata should be schema, partitioning, and string-map of properties; leaving out sorting for now. (Ready to vote on metadata SPIP.) *Topics*: - Issues raised by ORC v2 commit - Migration to v2 sources - Roadmap and current blockers - Catalog plugin system - Catalog API separate interfaces approach - Catalog API metadata (schema, partitioning, and properties) - Public catalog API proposal *Notes*: - Issues raised by ORC v2 commit - Ryan: Disabled change to use v2 by default in PR for overwrite plans: tests rely on CTAS, which is not implemented in v2. - Wenchen: suggested using a StagedTable to work around not having a CTAS finished. TableProvider could create a staged table. - Ryan: Using StagedTable doesn’t make sense to me. It was intended to solve a different problem (atomicity). Adding an interface to create a staged table either requires the same metadata as CTAS or requires a blank staged table, which isn’t the same concept: these staged tables would behave entirely differently than the ones for atomic operations. Better to spend time getting CTAS done and work through the long-term plan than to hack around it. - Second issue raised by the ORC work: how to support tables that use different validations. - Ryan: What Gengliang’s PRs are missing is a clear definition of what tables require different validation and what that validation should be. In some cases, CTAS is validated against existing data [Ed: this is PreprocessTableCreation] and in some cases, Append has no validation because the table doesn’t exist. What isn’t clear is when these validations are applied. - Ryan: Without knowing exactly how v1 works, we can’t mirror that behavior in v2. Building a way to turn off validation is going to be needed, but is insufficient without knowing when to apply it. - Ryan: We also don’t know if it will make sense to maintain all of these rules to mimic v1 behavior. In v1, CTAS and Append can both write to existing tables, but use different rules to validate. What are the differences between them? It is unlikely that Spark will support both as options, if that is even possible. [Ed: see later discussion on migration that continues this.] - Gengliang: Using SaveMode is an option. - Ryan: Using SaveMode only appears to fix this, but doesn’t actually test v2. Using SaveMode appears to work because it disables all validation and uses code from v1 that will “create” tables by writing. But this isn’t helpful for the v2 goal of having defined and reliable behavior. - Gengliang: SaveMode is not correctly translated. Append could mean AppendData or CTAS. - Ryan: This is why we need to focus on finishing the v2 plans: so we can correctly translate the SaveMode into the right plan. That depends on having a catalog for CTAS and to check the existence of a table. - Wenchen: Catalog doesn’t support path tables, so how does this help? - Ryan: The multi-catalog identifiers proposal includes a way to pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows a catalog implementation to handle path-based tables. The identifier will also have a method to test whether the identifier is a path identifier and catalogs are not required to support path identifiers. - Migration to v2 sources - Hyukjin: Once the ORC upgrade is done how will we move from v1 to v2? - Ryan: We will need to develop v1 and v2 in parallel. There are many code paths in v1 and we don’t know exactly what they do. We first need to know what they do and make a migration plan after that. - Hyukjin: What if there are many behavior differences? Will this require an API to opt in for each one? - Ryan: Without knowing how v1 behaves, we can only speculate. But I don’t think that we will want to support many of these special cases. That is a lot of work and maintenance. - Gengliang: When can we change the default to v2? Until we change the default, v2 is not tested. The v2 work is blocked by this. - Ryan: v2 work should not be
Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
I am cutting a new rc4 with fix from Felix. Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0359BC9965359766 On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung wrote: > > I merged the fix to 2.4. > > > > From: Felix Cheung > Sent: Wednesday, February 20, 2019 9:34 PM > To: DB Tsai; Spark dev list > Cc: Cesar Delgado > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) > > Could you hold for a bit - I have one more fix to get in > > > > From: d_t...@apple.com on behalf of DB Tsai > Sent: Wednesday, February 20, 2019 12:25 PM > To: Spark dev list > Cc: Cesar Delgado > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) > > Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859. > > DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc > > > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin > > wrote: > > > > Just wanted to point out that > > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC, > > and is marked as a correctness bug. (The fix is in the 2.4 branch, > > just not in rc2.) > > > > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai wrote: > >> > >> Please vote on releasing the following candidate as Apache Spark version > >> 2.4.1. > >> > >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes > >> are cast, with > >> a minimum of 3 +1 votes. > >> > >> [ ] +1 Release this package as Apache Spark 2.4.1 > >> [ ] -1 Do not release this package because ... > >> > >> To learn more about Apache Spark, please see http://spark.apache.org/ > >> > >> The tag to be voted on is v2.4.1-rc2 (commit > >> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960): > >> https://github.com/apache/spark/tree/v2.4.1-rc2 > >> > >> The release files, including signatures, digests, etc. can be found at: > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/ > >> > >> Signatures used for Spark RCs can be found in this file: > >> https://dist.apache.org/repos/dist/dev/spark/KEYS > >> > >> The staging repository for this release can be found at: > >> https://repository.apache.org/content/repositories/orgapachespark-1299/ > >> > >> The documentation corresponding to this release can be found at: > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/ > >> > >> The list of bug fixes going into 2.4.1 can be found at the following URL: > >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1 > >> > >> FAQ > >> > >> = > >> How can I help test this release? > >> = > >> > >> If you are a Spark user, you can help us test this release by taking > >> an existing Spark workload and running on this release candidate, then > >> reporting any regressions. > >> > >> If you're working in PySpark you can set up a virtual env and install > >> the current RC and see if anything important breaks, in the Java/Scala > >> you can add the staging repository to your projects resolvers and test > >> with the RC (make sure to clean up the artifact cache before/after so > >> you don't end up building with a out of date RC going forward). > >> > >> === > >> What should happen to JIRA tickets still targeting 2.4.1? > >> === > >> > >> The current list of open tickets targeted at 2.4.1 can be found at: > >> https://issues.apache.org/jira/projects/SPARK and search for "Target > >> Version/s" = 2.4.1 > >> > >> Committers should look at those and triage. Extremely important bug > >> fixes, documentation, and API tweaks that impact compatibility should > >> be worked on immediately. Everything else please retarget to an > >> appropriate release. > >> > >> == > >> But my bug isn't fixed? > >> == > >> > >> In order to make timely releases, we will typically not hold the > >> release unless the bug in question is a regression from the previous > >> release. That being said, if there is something which is a regression > >> that has not been correctly targeted please ping me or a committer to > >> help target the issue. > >> > >> > >> DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc > >> > >> > >> - > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > > > > > -- > > Marcelo > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
I merged the fix to 2.4. From: Felix Cheung Sent: Wednesday, February 20, 2019 9:34 PM To: DB Tsai; Spark dev list Cc: Cesar Delgado Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) Could you hold for a bit - I have one more fix to get in From: d_t...@apple.com on behalf of DB Tsai Sent: Wednesday, February 20, 2019 12:25 PM To: Spark dev list Cc: Cesar Delgado Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859. DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin > wrote: > > Just wanted to point out that > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC, > and is marked as a correctness bug. (The fix is in the 2.4 branch, > just not in rc2.) > > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai wrote: >> >> Please vote on releasing the following candidate as Apache Spark version >> 2.4.1. >> >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes are >> cast, with >> a minimum of 3 +1 votes. >> >> [ ] +1 Release this package as Apache Spark 2.4.1 >> [ ] -1 Do not release this package because ... >> >> To learn more about Apache Spark, please see http://spark.apache.org/ >> >> The tag to be voted on is v2.4.1-rc2 (commit >> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960): >> https://github.com/apache/spark/tree/v2.4.1-rc2 >> >> The release files, including signatures, digests, etc. can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/ >> >> Signatures used for Spark RCs can be found in this file: >> https://dist.apache.org/repos/dist/dev/spark/KEYS >> >> The staging repository for this release can be found at: >> https://repository.apache.org/content/repositories/orgapachespark-1299/ >> >> The documentation corresponding to this release can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/ >> >> The list of bug fixes going into 2.4.1 can be found at the following URL: >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1 >> >> FAQ >> >> = >> How can I help test this release? >> = >> >> If you are a Spark user, you can help us test this release by taking >> an existing Spark workload and running on this release candidate, then >> reporting any regressions. >> >> If you're working in PySpark you can set up a virtual env and install >> the current RC and see if anything important breaks, in the Java/Scala >> you can add the staging repository to your projects resolvers and test >> with the RC (make sure to clean up the artifact cache before/after so >> you don't end up building with a out of date RC going forward). >> >> === >> What should happen to JIRA tickets still targeting 2.4.1? >> === >> >> The current list of open tickets targeted at 2.4.1 can be found at: >> https://issues.apache.org/jira/projects/SPARK and search for "Target >> Version/s" = 2.4.1 >> >> Committers should look at those and triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should >> be worked on immediately. Everything else please retarget to an >> appropriate release. >> >> == >> But my bug isn't fixed? >> == >> >> In order to make timely releases, we will typically not hold the >> release unless the bug in question is a regression from the previous >> release. That being said, if there is something which is a regression >> that has not been correctly targeted please ping me or a committer to >> help target the issue. >> >> >> DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc >> >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Thoughts on dataframe cogroup?
I am wondering do other people have opinion/use case on cogroup? On Wed, Feb 20, 2019 at 5:03 PM Li Jin wrote: > Alessandro, > > Thanks for the reply. I assume by "equi-join", you mean "equality full > outer join" . > > Two issues I see with equity outer join is: > (1) equity outer join will give n * m rows for each key (n and m being the > corresponding number of rows in df1 and df2 for each key) > (2) User needs to do some extra processing to transform n * m back to the > desired shape (two sub dataframes with n and m rows) > > I think full outer join is an inefficient way to implement cogroup. If the > end goal is to have two separate dataframes for each key, why joining them > first and then unjoin them? > > > > On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando < > alessandro.solima...@gmail.com> wrote: > >> Hello, >> I fail to see how an equi-join on the key columns is different than the >> cogroup you propose. >> >> I think the accepted answer can shed some light: >> >> https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark >> >> Now you apply an udf on each iterable, one per key value (obtained with >> cogroup). >> >> You can achieve the same by: >> 1) join df1 and df2 on the key you want, >> 2) apply "groupby" on such key >> 3) finally apply a udaf (you can have a look here if you are not familiar >> with them >> https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), >> that will process each group "in isolation". >> >> HTH, >> Alessandro >> >> On Tue, 19 Feb 2019 at 23:30, Li Jin wrote: >> >>> Hi, >>> >>> We have been using Pyspark's groupby().apply() quite a bit and it has >>> been very helpful in integrating Spark with our existing pandas-heavy >>> libraries. >>> >>> Recently, we have found more and more cases where groupby().apply() is >>> not sufficient - In some cases, we want to group two dataframes by the same >>> key, and apply a function which takes two pd.DataFrame (also returns a >>> pd.DataFrame) for each key. This feels very much like the "cogroup" >>> operation in the RDD API. >>> >>> It would be great to be able to do sth like this: (not actual API, just >>> to explain the use case): >>> >>> @pandas_udf(return_schema, ...) >>> def my_udf(pdf1, pdf2) >>> # pdf1 and pdf2 are the subset of the original dataframes that is >>> associated with a particular key >>> result = ... # some code that uses pdf1 and pdf2 >>> return result >>> >>> df3 = cogroup(df1, df2, key='some_key').apply(my_udf) >>> >>> I have searched around the problem and some people have suggested to >>> join the tables first. However, it's often not the same pattern and hard to >>> get it to work by using joins. >>> >>> I wonder what are people's thought on this? >>> >>> Li >>> >>>
Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
That looks like a change to restore some behavior that was removed in 2.2. It's not directly relevant to a release vote on 2.4.1. See the existing discussion at https://github.com/apache/spark/pull/22144#issuecomment-432258536 It may indeed be a good thing to change but just continue the discussion as you work on your PR. On Thu, Feb 21, 2019 at 9:09 AM Parth Gandhi wrote: > > Hello, > In https://issues.apache.org/jira/browse/SPARK-24935, I am getting > requests from people that they were hoping for the fix to be merged in Spark > 2.4.1. The concerned PR is here: https://github.com/apache/spark/pull/23778. > I do not mind if we do not merge it for 2.4.1 and I do not want to be > considered this as a blocker for 2.4.1, but I would appreciate if somebody > can have a look and decide on the same. I have not written unit tests for the > PR yet, as I am relatively new to Catalyst and was hoping for someone to > confirm whether the fix is progressing in the right direction before > proceeding ahead with the unit tests. Thank you. > > Regards, > Parth Kamlesh Gandhi > > > On Wed, Feb 20, 2019 at 11:34 PM Felix Cheung > wrote: >> >> Could you hold for a bit - I have one more fix to get in >> >> >> >> From: d_t...@apple.com on behalf of DB Tsai >> Sent: Wednesday, February 20, 2019 12:25 PM >> To: Spark dev list >> Cc: Cesar Delgado >> Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) >> >> Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859. >> >> DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc >> >> > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin >> > wrote: >> > >> > Just wanted to point out that >> > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC, >> > and is marked as a correctness bug. (The fix is in the 2.4 branch, >> > just not in rc2.) >> > >> > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai wrote: >> >> >> >> Please vote on releasing the following candidate as Apache Spark version >> >> 2.4.1. >> >> >> >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes >> >> are cast, with >> >> a minimum of 3 +1 votes. >> >> >> >> [ ] +1 Release this package as Apache Spark 2.4.1 >> >> [ ] -1 Do not release this package because ... >> >> >> >> To learn more about Apache Spark, please see http://spark.apache.org/ >> >> >> >> The tag to be voted on is v2.4.1-rc2 (commit >> >> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960): >> >> https://github.com/apache/spark/tree/v2.4.1-rc2 >> >> >> >> The release files, including signatures, digests, etc. can be found at: >> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/ >> >> >> >> Signatures used for Spark RCs can be found in this file: >> >> https://dist.apache.org/repos/dist/dev/spark/KEYS >> >> >> >> The staging repository for this release can be found at: >> >> https://repository.apache.org/content/repositories/orgapachespark-1299/ >> >> >> >> The documentation corresponding to this release can be found at: >> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/ >> >> >> >> The list of bug fixes going into 2.4.1 can be found at the following URL: >> >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1 >> >> >> >> FAQ >> >> >> >> = >> >> How can I help test this release? >> >> = >> >> >> >> If you are a Spark user, you can help us test this release by taking >> >> an existing Spark workload and running on this release candidate, then >> >> reporting any regressions. >> >> >> >> If you're working in PySpark you can set up a virtual env and install >> >> the current RC and see if anything important breaks, in the Java/Scala >> >> you can add the staging repository to your projects resolvers and test >> >> with the RC (make sure to clean up the artifact cache before/after so >> >> you don't end up building with a out of date RC going forward). >> >> >> >> === >> >> What should happen to JIRA tickets still targeting 2.4.1? >> >> === >> >> >> >> The current list of open tickets targeted at 2.4.1 can be found at: >> >> https://issues.apache.org/jira/projects/SPARK and search for "Target >> >> Version/s" = 2.4.1 >> >> >> >> Committers should look at those and triage. Extremely important bug >> >> fixes, documentation, and API tweaks that impact compatibility should >> >> be worked on immediately. Everything else please retarget to an >> >> appropriate release. >> >> >> >> == >> >> But my bug isn't fixed? >> >> == >> >> >> >> In order to make timely releases, we will typically not hold the >> >> release unless the bug in question is a regression from the previous >> >> release. That being said, if there is something which is a regression >> >> that has not been correctly targeted please ping me or a committer to >> >> help target the issue. >> >> >> >> >> >> DB
Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
Hello, In https://issues.apache.org/jira/browse/SPARK-24935, I am getting requests from people that they were hoping for the fix to be merged in Spark 2.4.1. The concerned PR is here: https://github.com/apache/spark/pull/23778. I do not mind if we do not merge it for 2.4.1 and I do not want to be considered this as a blocker for 2.4.1, but I would appreciate if somebody can have a look and decide on the same. I have not written unit tests for the PR yet, as I am relatively new to Catalyst and was hoping for someone to confirm whether the fix is progressing in the right direction before proceeding ahead with the unit tests. Thank you. Regards, Parth Kamlesh Gandhi On Wed, Feb 20, 2019 at 11:34 PM Felix Cheung wrote: > Could you hold for a bit - I have one more fix to get in > > > -- > *From:* d_t...@apple.com on behalf of DB Tsai > *Sent:* Wednesday, February 20, 2019 12:25 PM > *To:* Spark dev list > *Cc:* Cesar Delgado > *Subject:* Re: [VOTE] Release Apache Spark 2.4.1 (RC2) > > Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859. > > DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, > Inc > > > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin > wrote: > > > > Just wanted to point out that > > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC, > > and is marked as a correctness bug. (The fix is in the 2.4 branch, > > just not in rc2.) > > > > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai > wrote: > >> > >> Please vote on releasing the following candidate as Apache Spark > version 2.4.1. > >> > >> The vote is open until Feb 24 PST and passes if a majority +1 PMC votes > are cast, with > >> a minimum of 3 +1 votes. > >> > >> [ ] +1 Release this package as Apache Spark 2.4.1 > >> [ ] -1 Do not release this package because ... > >> > >> To learn more about Apache Spark, please see http://spark.apache.org/ > >> > >> The tag to be voted on is v2.4.1-rc2 (commit > 229ad524cfd3f74dd7aa5fc9ba841ae223caa960): > >> https://github.com/apache/spark/tree/v2.4.1-rc2 > >> > >> The release files, including signatures, digests, etc. can be found at: > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/ > >> > >> Signatures used for Spark RCs can be found in this file: > >> https://dist.apache.org/repos/dist/dev/spark/KEYS > >> > >> The staging repository for this release can be found at: > >> https://repository.apache.org/content/repositories/orgapachespark-1299/ > >> > >> The documentation corresponding to this release can be found at: > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/ > >> > >> The list of bug fixes going into 2.4.1 can be found at the following > URL: > >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1 > >> > >> FAQ > >> > >> = > >> How can I help test this release? > >> = > >> > >> If you are a Spark user, you can help us test this release by taking > >> an existing Spark workload and running on this release candidate, then > >> reporting any regressions. > >> > >> If you're working in PySpark you can set up a virtual env and install > >> the current RC and see if anything important breaks, in the Java/Scala > >> you can add the staging repository to your projects resolvers and test > >> with the RC (make sure to clean up the artifact cache before/after so > >> you don't end up building with a out of date RC going forward). > >> > >> === > >> What should happen to JIRA tickets still targeting 2.4.1? > >> === > >> > >> The current list of open tickets targeted at 2.4.1 can be found at: > >> https://issues.apache.org/jira/projects/SPARK and search for "Target > Version/s" = 2.4.1 > >> > >> Committers should look at those and triage. Extremely important bug > >> fixes, documentation, and API tweaks that impact compatibility should > >> be worked on immediately. Everything else please retarget to an > >> appropriate release. > >> > >> == > >> But my bug isn't fixed? > >> == > >> > >> In order to make timely releases, we will typically not hold the > >> release unless the bug in question is a regression from the previous > >> release. That being said, if there is something which is a regression > >> that has not been correctly targeted please ping me or a committer to > >> help target the issue. > >> > >> > >> DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, > Inc > >> > >> > >> - > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > > > > > -- > > Marcelo > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > - > To unsubscribe e-mail:
Re: [VOTE] SPIP: Identifiers for multi-catalog Spark
+1 This is in the right direction. The resolution rules and catalog APIs need more discussion when we implement it. In the current stage, we can disallow the runtime creation of the catalog. This will complicate the name resolution in a multi-session environment. For example, when one user creates a catalog in one session, the other users' queries might return different results because the tables are differently resolved. We might need to investigate how the other systems deal with this. Cheers, Xiao Takeshi Yamamuro 于2019年2月20日周三 上午12:54写道: > +1 > > On Wed, Feb 20, 2019 at 4:59 PM JackyLee wrote: > >> +1 >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > -- > --- > Takeshi Yamamuro >