Re: Please take a look at the draft of the Spark 3.1.1 release notes
Hi Hyukjin, Thanks for your effort. One question: Do you automatically update the URLs to Spark documents in "the change of Behavior section" ? Currently, they refer to " https://spark.apache.org/docs/3.0.0/...";. I think that they should refer to "https://spark.apache.org/docs/3.1.1/..."; Regards, Kazuaki Ishizaki, From: Hyukjin Kwon To: dev Cc: Dongjoon Hyun , Jungtaek Lim , Tom Graves Date: 2021/03/02 11:20 Subject:Re: Please take a look at the draft of the Spark 3.1.1 release notes Thanks guys for suggestions and fixes. Now I feel pretty confident about the release notes :-). I will start uploading and preparing to announce Spark 3.1.1. 2021년 3월 2일 (화) 오전 7:29, Tom Graves 님이 작성: Thanks guys for suggestions and fixes. Now I feel pretty confident about the release notes :-). I will start uploading and preparing to announce Spark 3.1.1. 2021년 3월 2일 (화) 오전 7:29, Tom Graves 님이 작성: Thanks Hyukjin, overall they look good to me. Tom On Saturday, February 27, 2021, 05:00:42 PM CST, Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: Thanks Hyukjin! I've only looked into the SS part, and added a comment. Otherwise it looks great! On Sat, Feb 27, 2021 at 7:12 PM Dongjoon Hyun wrote: Thank you for sharing, Hyukjin! Dongjoon. On Sat, Feb 27, 2021 at 12:36 AM Hyukjin Kwon wrote: Hi all, I am preparing to publish and announce Spark 3.1.1. This is the draft of the release note, and I plan to edit a bit more and use it as the final release note. Please take a look and let me know if I missed any major changes or something. https://docs.google.com/document/d/1x6zzgRsZ4u1DgUh1XpGzX914CZbsHeRYpbqZ-PV6wdQ/edit?usp=sharing Thanks.
Re: Spark on JDK 14
Java 16 will also includes Vector API (incubator), which is a part of Project Panama, as shown in https://mail.openjdk.java.net/pipermail/panama-dev/2020-October/011149.html When the next LTS will be available, we could exploit it in Spark. Kazuaki Ishizaki From: Dongjoon Hyun To: Sean Owen Cc: dev Date: 2020/10/29 11:34 Subject:Re: Spark on JDK 14 Thank you for the sharing, Sean. Although Java 14 is already EOL (Sep. 2020), that is important information because we are tracking the Java upstream. Bests, Dongjoon. On Wed, Oct 28, 2020 at 1:44 PM Sean Owen wrote: For kicks, I tried Spark on JDK 14. 11 -> 14 doesn't change much, not as much as 8 -> 9 (-> 11), and indeed, virtually all tests pass. For the interested, these two seem to fail: - ZooKeeperPersistenceEngine *** FAILED *** org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /spark/master_status - parsing hour with various patterns *** FAILED *** java.time.format.DateTimeParseException: Text '2009-12-12 12 am' could not be parsed: Invalid value for HourOfAmPm (valid values 0 - 11): 12 I'd expect that most applications would just work now on Spark 3 + Java 14. I'd guess the same is true for Java 16 even, but, we're probably focused on the LTS releases. Kris Mok pointed out that Project Panama (in Java 17 maybe?) might have implications as it changes off-heap memory access.
RE: [DISCUSS] naming policy of Spark configs
+1 if we add them to Alternative config. Kazuaki Ishizaki From: Takeshi Yamamuro To: Wenchen Fan Cc: Spark dev list Date: 2020/02/13 16:02 Subject:[EXTERNAL] Re: [DISCUSS] naming policy of Spark configs +1; the idea sounds reasonable. Bests, Takeshi On Thu, Feb 13, 2020 at 12:39 PM Wenchen Fan wrote: Hi Dongjoon, It's too much work to revisit all the configs that added in 3.0, but I'll revisit the recent commits that update config names and see if they follow the new policy. Hi Reynold, There are a few interval configs: spark.sql.streaming.fileSink.log.compactInterval spark.sql.streaming.continuous.executorPollIntervalMs I think it's better to put the interval unit in the config name, like `executorPollIntervalMs`. Also the config should be created with `.timeConf`, so that users can set values like "1 second", "2 minutes", etc. There is no config that uses date/timestamp as value AFAIK. Thanks, Wenchen On Thu, Feb 13, 2020 at 11:29 AM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: +1 Thanks for the proposal. Looks very reasonable to me. On Thu, Feb 13, 2020 at 10:53 AM Hyukjin Kwon wrote: +1. 2020년 2월 13일 (목) 오전 9:30, Gengliang Wang < gengliang.w...@databricks.com>님이 작성: +1, this is really helpful. We should make the SQL configurations consistent and more readable. On Wed, Feb 12, 2020 at 3:33 PM Rubén Berenguel wrote: I love it, it will make configs easier to read and write. Thanks Wenchen. R On 13 Feb 2020, at 00:15, Dongjoon Hyun wrote: Thank you, Wenchen. The new policy looks clear to me. +1 for the explicit policy. So, are we going to revise the existing conf names before 3.0.0 release? Or, is it applied to new up-coming configurations from now? Bests, Dongjoon. On Wed, Feb 12, 2020 at 7:43 AM Wenchen Fan wrote: Hi all, I'd like to discuss the naming policy of Spark configs, as for now it depends on personal preference which leads to inconsistent namings. In general, the config name should be a noun that describes its meaning clearly. Good examples: spark.sql.session.timeZone spark.sql.streaming.continuous.executorQueueSize spark.sql.statistics.histogram.numBins Bad examples: spark.sql.defaultSizeInBytes (default size for what?) Also note that, config name has many parts, joined by dots. Each part is a namespace. Don't create namespace unnecessarily. Good example: spark.sql.execution.rangeExchange.sampleSizePerPartition spark.sql.execution.arrow.maxRecordsPerBatch Bad examples: spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a useful namespace, better to be .buffer.inMemoryThreshold) For a big feature, usually we need to create an umbrella config to turn it on/off, and other configs for fine-grained controls. These configs should share the same namespace, and the umbrella config should be named like featureName.enabled. For example: spark.sql.cbo.enabled spark.sql.cbo.starSchemaDetection spark.sql.cbo.starJoinFTRatio spark.sql.cbo.joinReorder.enabled spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace ) spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace ) For boolean configs, in general it should end with a verb, e.g. spark.sql.join.preferSortMergeJoin. If the config is for a feature and you can't find a good verb for the feature, featureName.enabled is also good. I'll update https://spark.apache.org/contributing.html after we reach a consensus here. Any comments are welcome! Thanks, Wenchen -- --- Takeshi Yamamuro
[ANNOUNCE] Announcing Apache Spark 2.3.4
We are happy to announce the availability of Spark 2.3.4! Spark 2.3.4 is a maintenance release containing stability fixes. This release is based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.4, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-2-3-4.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. Kazuaki Ishizaki
Re: Welcoming some new committers and PMC members
Congrats! Well deserved. Kazuaki Ishizaki, From: Matei Zaharia To: dev Date: 2019/09/10 09:32 Subject:[EXTERNAL] Welcoming some new committers and PMC members Hi all, The Spark PMC recently voted to add several new committers and one PMC member. Join me in welcoming them to their new roles! New PMC member: Dongjoon Hyun New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, Weichen Xu, Ruifeng Zheng The new committers cover lots of important areas including ML, SQL, and data sources, so it’s great to have them here. All the best, Matei and the Spark PMC - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[VOTE][RESULT] Spark 2.3.4 (RC1)
Hi, All. The vote passes. Thanks to all who helped with this release 2.3.4 (the final 2.3.x)! I'll follow up later with a release announcement once everything is published. +1 (* = binding): Sean Owen* Dongjoon Hyun DB Tsai* Wenchen Fan* Marco Gaido Shrinidhi kanchi John Zhuge Marcelo Vanzin* +0: None -1: None Regards, Kazuaki Ishizaki
RE: [VOTE] Release Apache Spark 2.4.4 (RC3)
+1 Built and tested with `mvn -Pyarn -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver test` on OpenJDK 1.8.0_211 on Ubuntu 16.04 x86_64 Regards, Kazuaki Ishizaki From: Dongjoon Hyun To: dev Date: 2019/08/28 12:14 Subject:[EXTERNAL] Re: [VOTE] Release Apache Spark 2.4.4 (RC3) +1. - Checked checksums and signatures of artifacts. - Checked to have all binaries and maven repo. - Checked document generation (including a new change after RC2) - Build with `-Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-2.6` on AdoptOpenJDK8_202. - Tested with both Scala-2.11/Scala-2.12 and both Python2/3. Python 2.7.15 with numpy 1.16.4, scipy 1.2.2, pandas 0.19.2, pyarrow 0.8.0 Python 3.6.4 with numpy 1.16.4, scipy 1.2.2, pandas 0.23.2, pyarrow 0.11.0 - Tested JDBC IT. Bests, Dongjoon. On Tue, Aug 27, 2019 at 4:05 PM Dongjoon Hyun wrote: Please vote on releasing the following candidate as Apache Spark version 2.4.4. The vote is open until August 30th 5PM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.4 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.4.4-rc3 (commit 7955b3962ac46b89564e0613db7bea98a1478bf2): https://github.com/apache/spark/tree/v2.4.4-rc3 The release files, including signatures, digests, etc. can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/ Signatures used for Spark RCs can be found in this file: https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1332/ The documentation corresponding to this release can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/ The list of bug fixes going into 2.4.4 can be found at the following URL: https://issues.apache.org/jira/projects/SPARK/versions/12345466 This release is using the release script of the tag v2.4.4-rc3. FAQ = How can I help test this release? = If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. If you're working in PySpark you can set up a virtual env and install the current RC and see if anything important breaks, in the Java/Scala you can add the staging repository to your projects resolvers and test with the RC (make sure to clean up the artifact cache before/after so you don't end up building with a out of date RC going forward). === What should happen to JIRA tickets still targeting 2.4.4? === The current list of open tickets targeted at 2.4.4 can be found at: https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.4.4 Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to an appropriate release. == But my bug isn't fixed? == In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from the previous release. That being said, if there is something which is a regression that has not been correctly targeted please ping me or a committer to help target the issue.
Re: [VOTE] Release Apache Spark 2.3.4 (RC1)
Thank you for pointing out the problem. The characters and hyperlink point different URLs. Could you please access https://repository.apache.org/content/repositories/orgapachespark-1331/ as you see characters? Sorry for your inconvenience. Kazuaki Ishizaki, From: Takeshi Yamamuro To: Kazuaki Ishizaki Cc: Apache Spark Dev Date: 2019/08/27 08:49 Subject:Re: [VOTE] Release Apache Spark 2.3.4 (RC1) Hi, Thanks for the release manage! It seems the staging repository has not been exposed yet? https://repository.apache.org/content/repositories/orgapachespark-1328/ On Tue, Aug 27, 2019 at 5:28 AM Kazuaki Ishizaki wrote: Please vote on releasing the following candidate as Apache Spark version 2.3.4. The vote is open until August 29th 2PM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.3.4 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see https://spark.apache.org/ The tag to be voted on is v2.3.4-rc1 (commit 8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210): https://github.com/apache/spark/tree/v2.3.4-rc1 The release files, including signatures, digests, etc. can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/ Signatures used for Spark RCs can be found in this file: https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1331/ The documentation corresponding to this release can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/ The list of bug fixes going into 2.3.4 can be found at the following URL: https://issues.apache.org/jira/projects/SPARK/versions/12344844 FAQ = How can I help test this release? = If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. If you're working in PySpark you can set up a virtual env and install the current RC and see if anything important breaks, in the Java/Scala you can add the staging repository to your projects resolvers and test with the RC (make sure to clean up the artifact cache before/after so you don't end up building with a out of date RC going forward). === What should happen to JIRA tickets still targeting 2.3.4? === The current list of open tickets targeted at 2.3.4 can be found at: https://issues.apache.org/jira/projects/SPARKand search for "Target Version/s" = 2.3.4 Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to an appropriate release. == But my bug isn't fixed? == In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from the previous release. That being said, if there is something which is a regression that has not been correctly targeted please ping me or a committer to help target the issue. -- --- Takeshi Yamamuro
[VOTE] Release Apache Spark 2.3.4 (RC1)
Please vote on releasing the following candidate as Apache Spark version 2.3.4. The vote is open until August 29th 2PM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.3.4 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see https://spark.apache.org/ The tag to be voted on is v2.3.4-rc1 (commit 8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210): https://github.com/apache/spark/tree/v2.3.4-rc1 The release files, including signatures, digests, etc. can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/ Signatures used for Spark RCs can be found in this file: https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1331/ The documentation corresponding to this release can be found at: https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/ The list of bug fixes going into 2.3.4 can be found at the following URL: https://issues.apache.org/jira/projects/SPARK/versions/12344844 FAQ = How can I help test this release? = If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. If you're working in PySpark you can set up a virtual env and install the current RC and see if anything important breaks, in the Java/Scala you can add the staging repository to your projects resolvers and test with the RC (make sure to clean up the artifact cache before/after so you don't end up building with a out of date RC going forward). === What should happen to JIRA tickets still targeting 2.3.4? === The current list of open tickets targeted at 2.3.4 can be found at: https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 2.3.4 Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to an appropriate release. == But my bug isn't fixed? == In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from the previous release. That being said, if there is something which is a regression that has not been correctly targeted please ping me or a committer to help target the issue.
RE: Release Spark 2.3.4
The following PRs regarding SPARK-28699 have been merged into branch-2.3. https://github.com/apache/spark/pull/25491 https://github.com/apache/spark/pull/25498 -> https://github.com/apache/spark/pull/25508 (backport to 2.3) I will cut `2.3.4-rc1` tag during weekend and starts 2.3.1 RC1 on next Monday. Regards, Kazuaki Ishizaki From: "Kazuaki Ishizaki" To: "Kazuaki Ishizaki" Cc: Dilip Biswal , dev , Hyukjin Kwon , jzh...@apache.org, Takeshi Yamamuro , Xiao Li Date: 2019/08/20 13:12 Subject:[EXTERNAL] RE: Release Spark 2.3.4 Due to the recent correctness issue at SPARK-28699, I will delay the release for Spark 2.3.4 RC1 for a while. https://issues.apache.org/jira/browse/SPARK-28699?focusedCommentId=16910859&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16910859 Regards, Kazuaki Ishizaki From:"Kazuaki Ishizaki" To:Hyukjin Kwon Cc:Dilip Biswal , dev , jzh...@apache.org, Takeshi Yamamuro , Xiao Li Date:2019/08/19 11:17 Subject:[EXTERNAL] RE: Release Spark 2.3.4 Hi all, Thank you. I will prepare RC for 2.3.4 this week in parallel. It will be in parallel with RC for 2.4.4 managed by Dongjoon. Regards, Kazuaki Ishizaki From:Hyukjin Kwon To:Dilip Biswal Cc: jzh...@apache.org, dev , Kazuaki Ishizaki , Takeshi Yamamuro , Xiao Li Date:2019/08/17 16:37 Subject:[EXTERNAL] Re: Release Spark 2.3.4 +1 too 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성 : +1 Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com - Original message - From: John Zhuge To: Xiao Li Cc: Takeshi Yamamuro , Spark dev list < dev@spark.apache.org>, Kazuaki Ishizaki Subject: [EXTERNAL] Re: Release Spark 2.3.4 Date: Fri, Aug 16, 2019 4:33 PM +1 On Fri, Aug 16, 2019 at 4:25 PM Xiao Li wrote: +1 On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro wrote: +1, too Bests, Takeshi On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun wrote: +1 for 2.3.4 release as the last release for `branch-2.3` EOL. Also, +1 for next week release. Bests, Dongjoon. On Fri, Aug 16, 2019 at 8:19 AM Sean Owen wrote: I think it's fine to do these in parallel, yes. Go ahead if you are willing. On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki wrote: > > Hi, All. > > Spark 2.3.3 was released six months ago (15th February, 2019) at http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 months have been passed after Spark 2.3.0 has been released (28th February, 2018). > As of today (16th August), there are 103 commits (69 JIRAs) in `branch-23` since 2.3.3. > > It would be great if we can have Spark 2.3.4. > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 2.4.4 will be released? > > A issue list in jira: https://issues.apache.org/jira/projects/SPARK/versions/12344844 > A commit list in github from the last release: https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3 > The 8 correctness issues resolved in branch-2.3: > https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC > > Best Regards, > Kazuaki Ishizaki - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- --- Takeshi Yamamuro -- -- John Zhuge - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
RE: Release Spark 2.3.4
Due to the recent correctness issue at SPARK-28699, I will delay the release for Spark 2.3.4 RC1 for a while. https://issues.apache.org/jira/browse/SPARK-28699?focusedCommentId=16910859&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16910859 Regards, Kazuaki Ishizaki From: "Kazuaki Ishizaki" To: Hyukjin Kwon Cc: Dilip Biswal , dev , jzh...@apache.org, Takeshi Yamamuro , Xiao Li Date: 2019/08/19 11:17 Subject:[EXTERNAL] RE: Release Spark 2.3.4 Hi all, Thank you. I will prepare RC for 2.3.4 this week in parallel. It will be in parallel with RC for 2.4.4 managed by Dongjoon. Regards, Kazuaki Ishizaki From:Hyukjin Kwon To:Dilip Biswal Cc:jzh...@apache.org, dev , Kazuaki Ishizaki , Takeshi Yamamuro , Xiao Li Date:2019/08/17 16:37 Subject:[EXTERNAL] Re: Release Spark 2.3.4 +1 too 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성 : +1 Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com - Original message - From: John Zhuge To: Xiao Li Cc: Takeshi Yamamuro , Spark dev list < dev@spark.apache.org>, Kazuaki Ishizaki Subject: [EXTERNAL] Re: Release Spark 2.3.4 Date: Fri, Aug 16, 2019 4:33 PM +1 On Fri, Aug 16, 2019 at 4:25 PM Xiao Li wrote: +1 On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro wrote: +1, too Bests, Takeshi On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun wrote: +1 for 2.3.4 release as the last release for `branch-2.3` EOL. Also, +1 for next week release. Bests, Dongjoon. On Fri, Aug 16, 2019 at 8:19 AM Sean Owen wrote: I think it's fine to do these in parallel, yes. Go ahead if you are willing. On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki wrote: > > Hi, All. > > Spark 2.3.3 was released six months ago (15th February, 2019) at http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 months have been passed after Spark 2.3.0 has been released (28th February, 2018). > As of today (16th August), there are 103 commits (69 JIRAs) in `branch-23` since 2.3.3. > > It would be great if we can have Spark 2.3.4. > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 2.4.4 will be released? > > A issue list in jira: https://issues.apache.org/jira/projects/SPARK/versions/12344844 > A commit list in github from the last release: https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3 > The 8 correctness issues resolved in branch-2.3: > https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC > > Best Regards, > Kazuaki Ishizaki - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- --- Takeshi Yamamuro -- -- John Zhuge - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
RE: Release Spark 2.3.4
Hi all, Thank you. I will prepare RC for 2.3.4 this week in parallel. It will be in parallel with RC for 2.4.4 managed by Dongjoon. Regards, Kazuaki Ishizaki From: Hyukjin Kwon To: Dilip Biswal Cc: jzh...@apache.org, dev , Kazuaki Ishizaki , Takeshi Yamamuro , Xiao Li Date: 2019/08/17 16:37 Subject:[EXTERNAL] Re: Release Spark 2.3.4 +1 too 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성 : +1 Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com - Original message - From: John Zhuge To: Xiao Li Cc: Takeshi Yamamuro , Spark dev list < dev@spark.apache.org>, Kazuaki Ishizaki Subject: [EXTERNAL] Re: Release Spark 2.3.4 Date: Fri, Aug 16, 2019 4:33 PM +1 On Fri, Aug 16, 2019 at 4:25 PM Xiao Li wrote: +1 On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro wrote: +1, too Bests, Takeshi On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun wrote: +1 for 2.3.4 release as the last release for `branch-2.3` EOL. Also, +1 for next week release. Bests, Dongjoon. On Fri, Aug 16, 2019 at 8:19 AM Sean Owen wrote: I think it's fine to do these in parallel, yes. Go ahead if you are willing. On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki wrote: > > Hi, All. > > Spark 2.3.3 was released six months ago (15th February, 2019) at http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 months have been passed after Spark 2.3.0 has been released (28th February, 2018). > As of today (16th August), there are 103 commits (69 JIRAs) in `branch-23` since 2.3.3. > > It would be great if we can have Spark 2.3.4. > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 2.4.4 will be released? > > A issue list in jira: https://issues.apache.org/jira/projects/SPARK/versions/12344844 > A commit list in github from the last release: https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3 > The 8 correctness issues resolved in branch-2.3: > https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC > > Best Regards, > Kazuaki Ishizaki - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- --- Takeshi Yamamuro -- -- John Zhuge - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Release Spark 2.3.4
Hi, All. Spark 2.3.3 was released six months ago (15th February, 2019) at http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18 months have been passed after Spark 2.3.0 has been released (28th February, 2018). As of today (16th August), there are 103 commits (69 JIRAs) in `branch-23` since 2.3.3. It would be great if we can have Spark 2.3.4. If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after 2.4.4 will be released? A issue list in jira: https://issues.apache.org/jira/projects/SPARK/versions/12344844 A commit list in github from the last release: https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3 The 8 correctness issues resolved in branch-2.3: https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC Best Regards, Kazuaki Ishizaki
Re: Release Apache Spark 2.4.4
Sure, I will launch a separate e-mail thread for discussing 2.3.4 later. Regards, Kazuaki Ishizaki, Ph.D. From: Dongjoon Hyun To: Sean Owen , Kazuaki Ishizaki Cc: dev Date: 2019/08/16 05:10 Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 +1 for that. Kazuaki volunteered for 2.3.4 release last month. AFAIK, he has been preparing that. - https://lists.apache.org/thread.html/6fafeefb7715e8764ccfe5d30c90d7444378b5f4f383ec95e2f1d7de@%3Cdev.spark.apache.org%3E I believe we can handle them after 2.4.4 RC1 (or concurrently.) Hi, Kazuaki. Could you start a separate email thread for 2.3.4 release? Bests, Dongjoon. On Thu, Aug 15, 2019 at 8:43 AM Sean Owen wrote: While we're on the topic: In theory, branch 2.3 is meant to be unsupported as of right about now. There are 69 fixes in branch 2.3 since 2.3.3 was released in Februrary: https://issues.apache.org/jira/projects/SPARK/versions/12344844 Some look moderately important. Should we also, or first, cut 2.3.4 to end the 2.3.x line? On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun wrote: > > Hi, All. > > Spark 2.4.3 was released three months ago (8th May). > As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` since 2.4.3. > > It would be great if we can have Spark 2.4.4. > Shall we start `2.4.4 RC1` next Monday (19th August)? > > Last time, there was a request for K8s issue and now I'm waiting for SPARK-27900. > Please let me know if there is another issue. > > Thanks, > Dongjoon.
RE: Release Apache Spark 2.4.4
Thanks, Dongjoon! +1 Kazuaki Ishizaki, From: Hyukjin Kwon To: Takeshi Yamamuro Cc: Dongjoon Hyun , dev , User Date: 2019/08/14 09:21 Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 +1 2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님 이 작성: Hi, Thanks for your notification, Dongjoon! I put some links for the other committers/PMCs to access the info easily: A commit list in github from the last release: https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4 A issue list in jira: https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body The 5 correctness issues resolved in branch-2.4: https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC Anyway, +1 Best, Takeshi On Wed, Aug 14, 2019 at 8:25 AM DB Tsai wrote: +1 On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun wrote: > > Hi, All. > > Spark 2.4.3 was released three months ago (8th May). > As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` since 2.4.3. > > It would be great if we can have Spark 2.4.4. > Shall we start `2.4.4 RC1` next Monday (19th August)? > > Last time, there was a request for K8s issue and now I'm waiting for SPARK-27900. > Please let me know if there is another issue. > > Thanks, > Dongjoon. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- --- Takeshi Yamamuro
Re: Re: Release Apache Spark 2.4.4 before 3.0.0
Thank you Dongjoon for being a release manager. If the assumed dates are ok, I would like to volunteer for an 2.3.4 release manager. Best Regards, Kazuaki Ishizaki, From: Dongjoon Hyun To: dev , "user @spark" , Apache Spark PMC Date: 2019/07/13 07:18 Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 before 3.0.0 Thank you, Jacek. BTW, I added `@private` since we need PMC's help to make an Apache Spark release. Can I get more feedbacks from the other PMC members? Please me know if you have any concerns (e.g. Release date or Release manager?) As one of the community members, I assumed the followings (if we are on schedule). - 2.4.4 at the end of July - 2.3.4 at the end of August (since 2.3.0 was released at the end of February 2018) - 3.0.0 (possibily September?) - 3.1.0 (January 2020?) Bests, Dongjoon. On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski wrote: Hi, Thanks Dongjoon Hyun for stepping up as a release manager! Much appreciated. If there's a volunteer to cut a release, I'm always to support it. In addition, the more frequent releases the better for end users so they have a choice to upgrade and have all the latest fixes or wait. It's their call not ours (when we'd keep them waiting). My big 2 yes'es for the release! Jacek On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, wrote: Hi, All. Spark 2.4.3 was released two months ago (8th May). As of today (9th July), there exist 45 fixes in `branch-2.4` including the following correctness or blocker issues. - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long - SPARK-26045 Error in the spark 2.4 release package with the spark-avro_2.11 dependency - SPARK-27798 from_avro can modify variables in other rows in local mode - SPARK-27907 HiveUDAF should return NULL in case of 0 rows - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries - SPARK-28308 CalendarInterval sub-second part should be padded before parsing It would be great if we can have Spark 2.4.4 before we are going to get busier for 3.0.0. If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll it next Monday. (15th July). How do you think about this? Bests, Dongjoon.
Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
+1 (non-binding) Kazuaki Ishizaki From: Bryan Cutler To: Bobby Evans Cc: Thomas graves , Spark dev list Date: 2019/05/09 03:20 Subject:Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support +1 (non-binding) On Tue, May 7, 2019 at 12:04 PM Bobby Evans wrote: I am +! On Tue, May 7, 2019 at 1:37 PM Thomas graves wrote: Hi everyone, I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs for extended Columnar Processing Support. The proposal is to extend the support to allow for more columnar processing. We had previous vote and discussion threads and have updated the SPIP based on the comments to clarify a few things and reduce the scope. You can find the updated proposal in the jira at: https://issues.apache.org/jira/browse/SPARK-27396. Please vote as early as you can, I will leave the vote open until next Monday (May 13th), 2pm CST to give people plenty of time. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Thanks! Tom Graves - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [DISCUSS] Spark Columnar Processing
Looks interesting discussion. Let me describe the current structure and remaining issues. This is orthogonal to cost-benefit trade-off discussion. The code generation basically consists of three parts. 1. Loading 2. Selection (map, filter, ...) 3. Projection 1. Columnar storage (e.g. Parquet, Orc, Arrow , and table cache) is well abstracted by using ColumnVector ( https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ) class. By combining with ColumnarBatchScan, the whole-stage code generation generate code to directly get valus from the columnar storage if there is no row-based operation. Note: The current master does not support Arrow as a data source. However, I think it is not technically hard to support Arrow. 2. The current whole-stage codegen generates code for element-wise selection (excluding sort and join). The SIMDzation or GPUization capability depends on a compiler that translates native code from the code generated by the whole-stage codegen. 3. The current Projection assume to store row-oriented data, I think that is a part that Wenchen pointed out My slides https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41 may simplify the above issue and possible implementation. FYI. NVIDIA will present an approach to exploit GPU with Arrow thru Python at SAIS 2019 https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=110 . I think that it uses Python UDF support with Arrow in Spark. P.S. I will give a presentation about in-memory data storages for SPark at SAIS 2019 https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40 :) Kazuaki Ishizaki From: Wenchen Fan To: Bobby Evans Cc: Spark dev list Date: 2019/03/26 13:53 Subject:Re: [DISCUSS] Spark Columnar Processing Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote: This thread is to discuss adding in support for data frame processing using an in-memory columnar format compatible with Apache Arrow. My main goal in this is to lay the groundwork so we can add in support for GPU accelerated processing of data frames, but this feature has a number of other benefits. Spark currently supports Apache Arrow formatted data as an option to exchange data with python for pandas UDF processing. There has also been discussion around extending this to allow for exchanging data with other tools like pytorch, tensorflow, xgboost,... If Spark supports processing on Arrow compatible data it could eliminate the serialization/deserialization overhead when going between these systems. It also would allow for doing optimizations on a CPU with SIMD instructions similar to what Hive currently supports. Accelerated processing using a GPU is something that we will start a separate discussion thread on, but I wanted to set the context a bit. Jason Lowe, Tom Graves, and I created a prototype over the past few months to try and understand how to make this work. What we are proposing is based off of lessons learned when building this prototype, but we really wanted to get feedback early on from the community. We will file a SPIP once we can get agreement that this is a good direction to go in. The current support for columnar processing lets a Parquet or Orc file format return a ColumnarBatch inside an RDD[InternalRow] using Scala’s type erasure. The code generation is aware that the RDD actually holds ColumnarBatchs and generates code to loop through the data in each batch as InternalRows. Instead, we propose a new set of APIs to work on an RDD[InternalColumnarBatch] instead of abusing type erasure. With this we propose adding in a Rule similar to how WholeStageCodeGen currently works. Each part of the physical SparkPlan would expose columnar support through a combination of traits and method calls. The rule would then decide when columnar processing would start and when it would end. Switching between columnar and row based processing is not free, so the rule would make a decision based off of an estimate of the cost to do the transformation and the estimated speedup in processing time. This should allow us to disable columnar support by simply disabling the rule that modifies the physical SparkPlan. It should be minimal risk to the existing row-based code path, as that code should not be touched, and in many cases could be reused to implement the columnar version. This also allows for small easily manageable patches. No huge patches that no one wants to review. As far as the memory layout is concerned OnHeapColumnVector and OffHeapColumnVector are already really close to being Apache Arrow compatible so shifting them over would be a relatively simple
Re: Welcome Jose Torres as a Spark committer
Congratulations, Jose! Kazuaki Ishizaki From: Gengliang Wang To: dev Date: 2019/01/31 18:32 Subject:Re: Welcome Jose Torres as a Spark committer Congrats Jose! 在 2019年1月31日,上午6:51,Bryan Cutler 写道: Congrats Jose! On Tue, Jan 29, 2019, 10:48 AM Shixiong Zhu
Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code
Hi all, I spend some time to consider great points. Sorry for my delay. I put comments in green into h ttps://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit Here are summary of comments: 1) For simplicity and expressiveness, introduce nodes to represent a structure (e.g. for, while) 2) For simplicity, measure some statistics (e.g. node / java bytecode, memory consumption) 3) For ease of understanding, use simple APIs like the original statements (op2, for, while, ...) We would appreciate it if you put any comments/suggestions on GoogleDoc/dev-ml for going forward. Kazuaki Ishizaki, From: "Kazuaki Ishizaki" To: Reynold Xin Cc: dev , Takeshi Yamamuro , Xiao Li Date: 2018/10/31 00:56 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi Reynold, Thank you for your comments. They are great points. 1) Yes, it is not easy to design the expressive and enough IR. We can learn concepts from good examples like HyPer, Weld, and others. They are expressive and not complicated. The detail cannot be captured yet, 2) To introduce another layer takes some time to learn new things. This SPIP tries to reduce learning time to preparing clean APIs for constructing generated code. I will try to add some examples for APIs that are equivalent to current string concatenations (e.g. "a" + " * " + "b" + " / " + "c"). It is important for us to learn from failures than learn from successes. We would appreciate it if you could list up failures that you have seen. Best Regards, Kazuaki Ishizaki From:Reynold Xin To:Kazuaki Ishizaki Cc:Xiao Li , dev , Takeshi Yamamuro Date:2018/10/26 03:46 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code I have some pretty serious concerns over this proposal. I agree that there are many things that can be improved, but at the same time I also think the cost of introducing a new IR in the middle is extremely high. Having participated in designing some of the IRs in other systems, I've seen more failures than successes. The failures typically come from two sources: (1) in general it is extremely difficult to design IRs that are both expressive enough and are simple enough; (2) typically another layer of indirection increases the complexity a lot more, beyond the level of understanding and expertise that most contributors can obtain without spending years in the code base and learning about all the gotchas. In either case, I'm not saying "no please don't do this". This is one of those cases in which the devils are in the details that cannot be captured by a high level document, and I want to explicitly express my concern here. On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki wrote: Hi Xiao, Thank you very much for becoming a shepherd. If you feel the discussion settles, we would appreciate it if you would start a voting. Regards, Kazuaki Ishizaki From:Xiao Li To:Kazuaki Ishizaki Cc:dev , Takeshi Yamamuro < linguin@gmail.com> Date:2018/10/22 16:31 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, Kazuaki, Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. Cheers, Xiao On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hi Yamamuro-san, Thank you for your comments. This SPIP gets several valuable comments and feedback on Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing . I hope that this SPIP could go forward based on these feedback. Based on this SPIP procedure http://spark.apache.org/improvement-proposals.html, can I ask one or more PMCs to become a shepherd of this SPIP? I would appreciate your kindness and cooperation. Best Regards, Kazuaki Ishizaki From:Takeshi Yamamuro To:Spark dev list Cc:ishiz...@jp.ibm.com Date:2018/10/15 12:12 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, ishizaki-san, Cool activity, I left some comments on the doc. best, takeshi On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hello community, I am writing this e-mail in order to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in a thread at https://github.com/apache/spark/pull/21537#issuecomment-413268196 Please feel free to comment on the JIRA ticket or Google Doc. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728 Google Doc
Re: Test and support only LTS JDK release?
This entry includes a good figure for support lifecycle. https://www.azul.com/products/zulu-and-zulu-enterprise/zulu-enterprise-java-support-options/ Kazuaki Ishizaki, From: Marcelo Vanzin To: Felix Cheung Cc: Ryan Blue , sn...@snazy.de, dev , Cesar Delgado Date: 2018/11/07 08:29 Subject:Re: Test and support only LTS JDK release? https://www.oracle.com/technetwork/java/javase/eol-135779.html On Tue, Nov 6, 2018 at 2:56 PM Felix Cheung wrote: > > Is there a list of LTS release that I can reference? > > > > From: Ryan Blue > Sent: Tuesday, November 6, 2018 1:28 PM > To: sn...@snazy.de > Cc: Spark Dev List; cdelg...@apple.com > Subject: Re: Test and support only LTS JDK release? > > +1 for supporting LTS releases. > > On Tue, Nov 6, 2018 at 11:48 AM Robert Stupp wrote: >> >> +1 on supporting LTS releases. >> >> VM distributors (RedHat, Azul - to name two) want to provide patches to LTS versions (i.e. into http://hg.openjdk.java.net/jdk-updates/jdk11u/ ). How that will play out in reality ... I don't know. Whether Oracle will contribute to that repo for 8 after it's EOL and 11 after the 6 month cycle ... we will see. Most Linux distributions promised(?) long-term support for Java 11 in their LTS releases (e.g. Ubuntu 18.04). I am not sure what that exactly means ... whether they will actively provide patches to OpenJDK or whether they just build from source. >> >> But considering that, I think it's definitely worth to at least keep an eye on Java 12 and 13 - even if those are just EA. Java 12 for example does already forbid some "dirty tricks" that are still possible in Java 11. >> >> >> On 11/6/18 8:32 PM, DB Tsai wrote: >> >> OpenJDK will follow Oracle's release cycle, https://openjdk.java.net/projects/jdk/ , a strict six months model. I'm not familiar with other non-Oracle VMs and Redhat support. >> >> DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc >> >> On Nov 6, 2018, at 11:26 AM, Reynold Xin wrote: >> >> What does OpenJDK do and other non-Oracle VMs? I know there was a lot of discussions from Redhat etc to support. >> >> >> On Tue, Nov 6, 2018 at 11:24 AM DB Tsai wrote: >>> >>> Given Oracle's new 6-month release model, I feel the only realistic option is to only test and support JDK such as JDK 11 LTS and future LTS release. I would like to have a discussion on this in Spark community. >>> >>> Thanks, >>> >>> DB Tsai | Siri Open Source Technologies [not a contribution] | Apple, Inc >>> >> >> -- >> Robert Stupp >> @snazy > > > > -- > Ryan Blue > Software Engineer > Netflix -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code
Hi Reynold, Thank you for your comments. They are great points. 1) Yes, it is not easy to design the expressive and enough IR. We can learn concepts from good examples like HyPer, Weld, and others. They are expressive and not complicated. The detail cannot be captured yet, 2) To introduce another layer takes some time to learn new things. This SPIP tries to reduce learning time to preparing clean APIs for constructing generated code. I will try to add some examples for APIs that are equivalent to current string concatenations (e.g. "a" + " * " + "b" + " / " + "c"). It is important for us to learn from failures than learn from successes. We would appreciate it if you could list up failures that you have seen. Best Regards, Kazuaki Ishizaki From: Reynold Xin To: Kazuaki Ishizaki Cc: Xiao Li , dev , Takeshi Yamamuro Date: 2018/10/26 03:46 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code I have some pretty serious concerns over this proposal. I agree that there are many things that can be improved, but at the same time I also think the cost of introducing a new IR in the middle is extremely high. Having participated in designing some of the IRs in other systems, I've seen more failures than successes. The failures typically come from two sources: (1) in general it is extremely difficult to design IRs that are both expressive enough and are simple enough; (2) typically another layer of indirection increases the complexity a lot more, beyond the level of understanding and expertise that most contributors can obtain without spending years in the code base and learning about all the gotchas. In either case, I'm not saying "no please don't do this". This is one of those cases in which the devils are in the details that cannot be captured by a high level document, and I want to explicitly express my concern here. On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki wrote: Hi Xiao, Thank you very much for becoming a shepherd. If you feel the discussion settles, we would appreciate it if you would start a voting. Regards, Kazuaki Ishizaki From:Xiao Li To:Kazuaki Ishizaki Cc:dev , Takeshi Yamamuro < linguin@gmail.com> Date:2018/10/22 16:31 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, Kazuaki, Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. Cheers, Xiao On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hi Yamamuro-san, Thank you for your comments. This SPIP gets several valuable comments and feedback on Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing . I hope that this SPIP could go forward based on these feedback. Based on this SPIP procedure http://spark.apache.org/improvement-proposals.html, can I ask one or more PMCs to become a shepherd of this SPIP? I would appreciate your kindness and cooperation. Best Regards, Kazuaki Ishizaki From:Takeshi Yamamuro To:Spark dev list Cc:ishiz...@jp.ibm.com Date:2018/10/15 12:12 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, ishizaki-san, Cool activity, I left some comments on the doc. best, takeshi On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hello community, I am writing this e-mail in order to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in a thread at https://github.com/apache/spark/pull/21537#issuecomment-413268196 Please feel free to comment on the JIRA ticket or Google Doc. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728 Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing Looking forward to hear your feedback Best Regards, Kazuaki Ishizaki -- --- Takeshi Yamamuro --
Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code
Hi Xiao, Thank you very much for becoming a shepherd. If you feel the discussion settles, we would appreciate it if you would start a voting. Regards, Kazuaki Ishizaki From: Xiao Li To: Kazuaki Ishizaki Cc: dev , Takeshi Yamamuro Date: 2018/10/22 16:31 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, Kazuaki, Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. Cheers, Xiao On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hi Yamamuro-san, Thank you for your comments. This SPIP gets several valuable comments and feedback on Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing . I hope that this SPIP could go forward based on these feedback. Based on this SPIP procedure http://spark.apache.org/improvement-proposals.html, can I ask one or more PMCs to become a shepherd of this SPIP? I would appreciate your kindness and cooperation. Best Regards, Kazuaki Ishizaki From:Takeshi Yamamuro To:Spark dev list Cc:ishiz...@jp.ibm.com Date:2018/10/15 12:12 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, ishizaki-san, Cool activity, I left some comments on the doc. best, takeshi On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hello community, I am writing this e-mail in order to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in a thread at https://github.com/apache/spark/pull/21537#issuecomment-413268196 Please feel free to comment on the JIRA ticket or Google Doc. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728 Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing Looking forward to hear your feedback Best Regards, Kazuaki Ishizaki -- --- Takeshi Yamamuro --
Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code
Hi Yamamuro-san, Thank you for your comments. This SPIP gets several valuable comments and feedback on Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing . I hope that this SPIP could go forward based on these feedback. Based on this SPIP procedure http://spark.apache.org/improvement-proposals.html, can I ask one or more PMCs to become a shepherd of this SPIP? I would appreciate your kindness and cooperation. Best Regards, Kazuaki Ishizaki From: Takeshi Yamamuro To: Spark dev list Cc: ishiz...@jp.ibm.com Date: 2018/10/15 12:12 Subject:Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code Hi, ishizaki-san, Cool activity, I left some comments on the doc. best, takeshi On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki wrote: Hello community, I am writing this e-mail in order to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in a thread at https://github.com/apache/spark/pull/21537#issuecomment-413268196 Please feel free to comment on the JIRA ticket or Google Doc. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728 Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing Looking forward to hear your feedback Best Regards, Kazuaki Ishizaki -- --- Takeshi Yamamuro
SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code
Hello community, I am writing this e-mail in order to start a discussion about adding structure intermediate representation for generating Java code from a program using DataFrame or Dataset API, in addition to the current String-based representation. This addition is based on the discussions in a thread at https://github.com/apache/spark/pull/21537#issuecomment-413268196 Please feel free to comment on the JIRA ticket or Google Doc. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728 Google Doc: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing Looking forward to hear your feedback Best Regards, Kazuaki Ishizaki
Re: Spark JIRA tags clarification and management
Of course, we would like to eliminate all of the following tags "flanky" or "flankytest" Kazuaki Ishizaki From: Hyukjin Kwon To: dev Cc: Xiao Li , Wenchen Fan Date: 2018/09/04 14:20 Subject:Re: Spark JIRA tags clarification and management Thanks, Reynold. +Adding Xiao and Wenchen who I saw often used tags. Would you have some tags you think we should document more? 2018년 9월 4일 (화) 오전 9:27, Reynold Xin 님이 작성: The most common ones we do are: releasenotes correctness On Mon, Sep 3, 2018 at 6:23 PM Hyukjin Kwon wrote: Thanks, Felix and Reynold. Would you guys mind if I ask this to anyone who use the tags frequently? Frankly, I don't use the tags often .. 2018년 9월 4일 (화) 오전 2:04, Felix Cheung 님 이 작성: +1 good idea. There are a few for organizing but some also are critical to the release process, like rel note. Would be good to clarify. From: Reynold Xin Sent: Sunday, September 2, 2018 11:50 PM To: Hyukjin Kwon Cc: dev Subject: Re: Spark JIRA tags clarification and management It would be great to document the common ones. On Sun, Sep 2, 2018 at 11:49 PM Hyukjin Kwon wrote: Hi all, I lately noticed tags are often used to classify JIRAs. I was thinking we better explicitly document what tags are used and explain which tag means what. For instance, we documented "Contributing to JIRA Maintenance" at https://spark.apache.org/contributing.html before (thanks, Sean Owen) - this helps me a lot to managing JIRAs, and they are good standards for, at least, me to take an action. It doesn't necessarily mean we should clarify everything but it might be good to document tags used often. We can leave this for committer's scope as well, if that's preferred - I don't have a strong opinion on this. My point is, can we clarify this in the contributing guide so that we can reduce the maintenance cost?
Re: [SPARK ML] Minhash integer overflow
Of course, the hash value can just be negative. I thought that it would be after computation without overflow. When I checked another implementation, it performs computations with int. https://github.com/ALShum/MinHashLSH/blob/master/LSH.java#L89 By copy to Xjiayuan, did you compare the hash value generated by Spark with it generated by other implementations? Regards, Kazuaki Ishizaki From: Sean Owen To: jiayuanm Cc: dev@spark.apache.org Date: 2018/07/07 15:46 Subject:Re: [SPARK ML] Minhash integer overflow I think it probably still does its.job; the hash value can just be negative. It is likely to be very slightly biased though. Because the intent doesn't seem to be to allow the overflow it's worth changing to use longs for the calculation. On Fri, Jul 6, 2018, 8:36 PM jiayuanm wrote: Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 ). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [SPARK ML] Minhash integer overflow
Thank for you reporting this issue. I think this is a bug regarding integer overflow. IMHO, it would be good to compute hashes with Long. Would it be possible to create a JIRA entry? Do you want to submit a pull request, too? Regards, Kazuaki Ishizaki From: jiayuanm To: dev@spark.apache.org Date: 2018/07/07 10:36 Subject:[SPARK ML] Minhash integer overflow Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 ). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: SparkR test failures in PR builder
I am not familiar with SparkR or CRAN. However, I remember that we had the similar situation. Here is a great work at that time. When I have just visited this PR, I think that we have the similar situation (i.e. format error) again. https://github.com/apache/spark/pull/20005 Any other comments are appreciated. Regards, Kazuaki Ishizaki From: Joseph Bradley To: dev Cc: Hossein Falaki Date: 2018/05/03 07:31 Subject:SparkR test failures in PR builder Hi all, Does anyone know why the PR builder keeps failing on SparkR's CRAN checks? I've seen this in a lot of unrelated PRs. E.g.: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console Hossein spotted this line: ``` * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` and suggested that it could be CRAN flakiness. I'm not familiar with CRAN, but do others have thoughts about how to fix this? Thanks! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc.
Re: Welcome Zhenhua Wang as a Spark committer
Congratulations to Zhenhua! Kazuaki Ishizaki From: sujith chacko To: Denny Lee Cc: Spark dev list , Wenchen Fan , "叶先进" Date: 2018/04/02 14:37 Subject:Re: Welcome Zhenhua Wang as a Spark committer Congratulations zhenhua for this great achievement. On Mon, 2 Apr 2018 at 11:05 AM, Denny Lee wrote: Awesome - congrats Zhenhua! On Sun, Apr 1, 2018 at 10:33 PM 叶先进 wrote: Big congs. > On Apr 2, 2018, at 1:28 PM, Wenchen Fan wrote: > > Hi all, > > The Spark PMC recently added Zhenhua Wang as a committer on the project. Zhenhua is the major contributor of the CBO project, and has been contributing across several areas of Spark for a while, focusing especially on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua! > > Wenchen - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Welcoming some new committers
Congratulations to everyone! Kazuaki Ishizaki From: Takeshi Yamamuro To: Spark dev list Date: 2018/03/03 10:45 Subject:Re: Welcoming some new committers Congrats, all! On Sat, Mar 3, 2018 at 10:34 AM, Takuya UESHIN wrote: Congratulations and welcome! On Sat, Mar 3, 2018 at 10:21 AM, Xingbo Jiang wrote: Congratulations to everyone! 2018-03-03 8:51 GMT+08:00 Ilan Filonenko : Congrats to everyone! :) On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung wrote: Congrats and welcome! From: Dongjoon Hyun Sent: Friday, March 2, 2018 4:27:10 PM To: Spark dev list Subject: Re: Welcoming some new committers Congrats to all! Bests, Dongjoon. On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan wrote: Congratulations to everyone and welcome! On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger wrote: Congrats to the new committers, and I appreciate the vote of confidence. On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia wrote: > Hi everyone, > > The Spark PMC has recently voted to add several new committers to the project, based on their contributions to Spark 2.3 and other past work: > > - Anirudh Ramanathan (contributor to Kubernetes support) > - Bryan Cutler (contributor to PySpark and Arrow support) > - Cody Koeninger (contributor to streaming and Kafka support) > - Erik Erlandson (contributor to Kubernetes support) > - Matt Cheah (contributor to Kubernetes support and other parts of Spark) > - Seth Hendrickson (contributor to MLlib and PySpark) > > Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as committers! > > Matei > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin -- --- Takeshi Yamamuro
Re: Whole-stage codegen and SparkPlan.newPredicate
Thank you for your correction :) I also made mistake in a report. What I reported at first never occurs with the correct Java bean class. Finally, I can reproduce a problem that Jacek reported even using the master. In my environment, this problem occurs with or without whole-stage codegen. I updated the JIRA ticket. I am still working for this. Kazuaki Ishizaki From: Herman van Hövell tot Westerflier To: Kazuaki Ishizaki Cc: Jacek Laskowski , dev Date: 2018/01/02 04:12 Subject:Re: Whole-stage codegen and SparkPlan.newPredicate Wrong ticket: https://issues.apache.org/jira/browse/SPARK-22935 Thanks for working on this :) On Mon, Jan 1, 2018 at 2:22 PM, Kazuaki Ishizaki wrote: I ran the program in URL of stackoverflow with Spark 2.2.1 and master. I cannot see the exception even when I disabled whole-stage codegen. Am I wrong? We would appreciate it if you could create a JIRA entry with simple standalone repro. In addition to this report, I realized that this program produces incorrect results. I created a JIRA entry https://issues.apache.org/jira/browse/SPARK-22934. Best Regards, Kazuaki Ishizaki From:Herman van Hövell tot Westerflier To:Jacek Laskowski Cc:dev Date:2017/12/31 21:44 Subject:Re: Whole-stage codegen and SparkPlan.newPredicate Hi Jacek, In this case whole stage code generation is turned off. However we still use code generation for a lot of other things: projections, predicates, orderings & encoders. You are currently seeing a compile time failure while generating a predicate. There is currently no easy way to turn code generation off entirely. The error itself is not great, but it still captures the problem in a relatively timely fashion. We should have caught this during analysis though. Can you file a ticket? - Herman On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski wrote: Hi, While working on an issue with Whole-stage codegen as reported @ https://stackoverflow.com/q/48026060/1305344I found out that spark.sql.codegen.wholeStage=false does *not* turn whole-stage codegen off completely. It looks like SparkPlan.newPredicate [1] gets called regardless of the value of spark.sql.codegen.wholeStage property. $ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false ... scala> spark.sessionState.conf.wholeStageEnabled res7: Boolean = false That leads to an issue in the SO question with whole-stage codegen regardless of the value: ... at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:385) at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:214) at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:213) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:816) ... Is this a bug or does it work as intended? Why? [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala?utf8=%E2%9C%93#L386 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski
Re: Whole-stage codegen and SparkPlan.newPredicate
I ran the program in URL of stackoverflow with Spark 2.2.1 and master. I cannot see the exception even when I disabled whole-stage codegen. Am I wrong? We would appreciate it if you could create a JIRA entry with simple standalone repro. In addition to this report, I realized that this program produces incorrect results. I created a JIRA entry https://issues.apache.org/jira/browse/SPARK-22934. Best Regards, Kazuaki Ishizaki From: Herman van Hövell tot Westerflier To: Jacek Laskowski Cc: dev Date: 2017/12/31 21:44 Subject:Re: Whole-stage codegen and SparkPlan.newPredicate Hi Jacek, In this case whole stage code generation is turned off. However we still use code generation for a lot of other things: projections, predicates, orderings & encoders. You are currently seeing a compile time failure while generating a predicate. There is currently no easy way to turn code generation off entirely. The error itself is not great, but it still captures the problem in a relatively timely fashion. We should have caught this during analysis though. Can you file a ticket? - Herman On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski wrote: Hi, While working on an issue with Whole-stage codegen as reported @ https://stackoverflow.com/q/48026060/1305344 I found out that spark.sql.codegen.wholeStage=false does *not* turn whole-stage codegen off completely. It looks like SparkPlan.newPredicate [1] gets called regardless of the value of spark.sql.codegen.wholeStage property. $ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false ... scala> spark.sessionState.conf.wholeStageEnabled res7: Boolean = false That leads to an issue in the SO question with whole-stage codegen regardless of the value: ... at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:385) at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:214) at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(basicPhysicalOperators.scala:213) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:816) ... Is this a bug or does it work as intended? Why? [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala?utf8=%E2%9C%93#L386 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski
Re: Timeline for Spark 2.3
+1 for cutting a branch earlier. In some Asian countries, 1st, 2nd, and 3rd January are off. https://www.timeanddate.com/holidays/ How about 4th or 5th? Regards, Kazuaki Ishizaki From: Felix Cheung To: Michael Armbrust , Holden Karau Cc: Sameer Agarwal , Erik Erlandson , dev Date: 2017/12/21 04:48 Subject:Re: Timeline for Spark 2.3 +1 I think the earlier we cut a branch the better. From: Michael Armbrust Sent: Tuesday, December 19, 2017 4:41:44 PM To: Holden Karau Cc: Sameer Agarwal; Erik Erlandson; dev Subject: Re: Timeline for Spark 2.3 Do people really need to be around for the branch cut (modulo the person cutting the branch)? 1st or 2nd doesn't really matter to me, but I am +1 kicking this off as soon as we enter the new year :) Michael On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau wrote: Sounds reasonable, although I'd choose the 2nd perhaps just since lots of folks are off on the 1st? On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal wrote: Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that (i.e., week of 8th Jan)? On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau wrote: So personally I’d be in favour or pushing to early January, doing a release over the holidays is a little rough with herding all of people to vote. On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson wrote: I wanted to check in on the state of the 2.3 freeze schedule. Original proposal was "late Dec", which is a bit open to interpretation. We are working to get some refactoring done on the integration testing for the Kubernetes back-end in preparation for testing upcoming release candidates, however holiday vacation time is about to begin taking its toll both on upstream reviewing and on the "downstream" spark-on-kube fork. If the freeze pushed into January, that would take some of the pressure off the kube back-end upstreaming. However, regardless, I was wondering if the dates could be clarified. Cheers, Erik On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com wrote: Hi, What is the process to request an issue/fix to be included in the next release? Is there a place to vote for features? I am interested in https://issues.apache.org/jira/browse/SPARK-13127, to see if we can get Spark upgrade parquet to 1.9.0, which addresses the https://issues.apache.org/jira/browse/PARQUET-686. Can we include the fix in Spark 2.3 release? Thanks, Dong -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Twitter: https://twitter.com/holdenkarau -- Sameer Agarwal Software Engineer | Databricks Inc. http://cs.berkeley.edu/~sameerag -- Twitter: https://twitter.com/holdenkarau
Re: [VOTE] Spark 2.2.1 (RC2)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core/sql-core/sql-catalyst/mllib/mllib-local have passed. $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 24 clean package install % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local ... Run completed in 13 minutes, 54 seconds. Total number of tests run: 1118 Suites: completed 170, aborted 0 Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0 All tests passed. [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Core . SUCCESS [17:13 min] [INFO] Spark Project ML Local Library . SUCCESS [ 6.065 s] [INFO] Spark Project Catalyst . SUCCESS [11:51 min] [INFO] Spark Project SQL .. SUCCESS [17:55 min] [INFO] Spark Project ML Library ... SUCCESS [17:05 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:04 h [INFO] Finished at: 2017-11-30T01:48:15+09:00 [INFO] Final Memory: 128M/329M [INFO] [WARNING] The requested profile "hive" could not be activated because it does not exist. Kazuaki Ishizaki From: Dongjoon Hyun To: Hyukjin Kwon Cc: Spark dev list , Felix Cheung , Sean Owen Date: 2017/11/29 12:56 Subject:Re: [VOTE] Spark 2.2.1 (RC2) +1 (non-binding) RC2 is tested on CentOS, too. Bests, Dongjoon. On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon wrote: +1 2017-11-29 8:18 GMT+09:00 Henry Robinson : (My vote is non-binding, of course). On 28 November 2017 at 14:53, Henry Robinson wrote: +1, tests all pass for me on Ubuntu 16.04. On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: +1 On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung wrote: +1 Thanks Sean. Please vote! Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel and release and on r-hub. Verified CRAN checks are clean (only 1 NOTE!) and no leaked files (.cache removed, /tmp clean) On Sun, Nov 26, 2017 at 11:55 AM Sean Owen wrote: Yes it downloads recent releases. The test worked for me on a second try, so I suspect a bad mirror. If this comes up frequently we can just add retry logic, as the closer.lua script will return different mirrors each time. The tests all pass for me on the latest Debian, so +1 for this release. (I committed the change to set -Xss4m for tests consistently, but this shouldn't block a release.) On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung wrote: Ah sorry digging through the history it looks like this is changed relatively recently and should only download previous releases. Perhaps we are intermittently hitting a mirror that doesn’t have the files? https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung wrote: Thanks Sean. For the second one, it looks like the HiveExternalCatalogVersionsSuite is trying to download the release tgz from the official Apache mirror, which won’t work unless the release is actually, released? val preferredMirror = Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true";, "-q", "-O", "-").!!.trim val url = s "$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz" It’s proabbly getting an error page instead. On Sat, Nov 25, 2017 at 10:28 AM Sean Owen wrote: I hit the same StackOverflowError as in the previous RC test, but, pretty sure this is just because the increased thread stack size JVM flag isn't applied consistently. This seems to resolve it: https://github.com/apache/spark/pull/19820 This wouldn't block release IMHO. I am currently investigating this failure though -- seems like the mechanism that downloads Spark tarballs needs fixing, or updating, in the 2.2 branch? HiveExternalCatalogVersionsSuite: gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now *** RUN ABORTED *** java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or directory On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung wrote: Please vote on releasing the following candidate as Apach
Re: [VOTE] Spark 2.1.2 (RC4)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core/sql-core/sql-catalyst/mllib/mllib-local have passed. $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 24 clean package install % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local ... Run completed in 12 minutes, 19 seconds. Total number of tests run: 1035 Suites: completed 166, aborted 0 Tests: succeeded 1035, failed 0, canceled 0, ignored 5, pending 0 All tests passed. [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Core . SUCCESS [17:13 min] [INFO] Spark Project ML Local Library . SUCCESS [ 5.759 s] [INFO] Spark Project Catalyst . SUCCESS [09:48 min] [INFO] Spark Project SQL .. SUCCESS [12:01 min] [INFO] Spark Project ML Library ... SUCCESS [15:16 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 54:28 min [INFO] Finished at: 2017-10-03T23:53:33+09:00 [INFO] Final Memory: 112M/322M [INFO] [WARNING] The requested profile "hive" could not be activated because it does not exist. Kazuaki Ishizaki From: Dongjoon Hyun To: Spark dev list Date: 2017/10/03 23:23 Subject:Re: [VOTE] Spark 2.1.2 (RC4) +1 (non-binding) Dongjoon. On Tue, Oct 3, 2017 at 5:13 AM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: +1 On Tue, Oct 3, 2017 at 1:32 PM, Sean Owen wrote: +1 same as last RC. Tests pass, sigs and hashes are OK. On Tue, Oct 3, 2017 at 7:24 AM Holden Karau wrote: Please vote on releasing the following candidate as Apache Spark version 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see https://spark.apache.org/ The tag to be voted on is v2.1.2-rc4 ( 2abaea9e40fce81cd4626498e0f5c28a70917499) List of JIRA tickets resolved in this release can be found with this filter. The release files, including signatures, digests, etc. can be found at: https://home.apache.org/~holden/spark-2.1.2-rc4-bin/ Release artifacts are signed with a key from: https://people.apache.org/~holden/holdens_keys.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1252 The documentation corresponding to this release can be found at: https://people.apache.org/~holden/spark-2.1.2-rc4-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. If you're working in PySpark you can set up a virtual env and install the current RC and see if anything important breaks, in the Java/Scala you can add the staging repository to your projects resolvers and test with the RC (make sure to clean up the artifact cache before/after so you don't end up building with a out of date RC going forward). What should happen to JIRA tickets still targeting 2.1.2? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.1.3. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1. That being said if there is something which is a regression form 2.1.1 that has not been correctly targeted please ping a committer to help target the issue (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2) What are the unresolved issues targeted for 2.1.2? At this time there are no open unresolved issues. Is there anything different about this release? This is the first release in awhile not built on the AMPLAB Jenkins. This is good because it means future releases can more easily be built and signed securely (and I've been updating the documentation in https://github.com/apache/spark-website/pull/66 as I progress), however the chances of a mistake are higher with any change like this. If there something you normal
Re: Welcoming Tejas Patil as a Spark committer
Congratulation Tejas! Kazuaki Ishizaki From: Matei Zaharia To: "dev@spark.apache.org" Date: 2017/09/30 04:58 Subject:Welcoming Tejas Patil as a Spark committer Hi all, The Spark PMC recently added Tejas Patil as a committer on the project. Tejas has been contributing across several areas of Spark for a while, focusing especially on scalability issues and SQL. Please join me in welcoming Tejas! Matei - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Spark 2.1.2 (RC2)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core/sql-core/sql-catalyst/mllib/mllib-local have passed. $ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 24 clean package install % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local ... Run completed in 12 minutes, 42 seconds. Total number of tests run: 1035 Suites: completed 166, aborted 0 Tests: succeeded 1035, failed 0, canceled 0, ignored 5, pending 0 All tests passed. [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Core . SUCCESS [17:14 min] [INFO] Spark Project ML Local Library . SUCCESS [ 4.067 s] [INFO] Spark Project Catalyst . SUCCESS [08:23 min] [INFO] Spark Project SQL .. SUCCESS [10:50 min] [INFO] Spark Project ML Library ... SUCCESS [15:45 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 52:20 min [INFO] Finished at: 2017-09-28T12:16:46+09:00 [INFO] Final Memory: 103M/309M [INFO] [WARNING] The requested profile "hive" could not be activated because it does not exist. Kazuaki Ishizaki From: Dongjoon Hyun To: Denny Lee Cc: Sean Owen , Holden Karau , "dev@spark.apache.org" Date: 2017/09/28 07:57 Subject:Re: [VOTE] Spark 2.1.2 (RC2) +1 (non-binding) Bests, Dongjoon. On Wed, Sep 27, 2017 at 7:54 AM, Denny Lee wrote: +1 (non-binding) On Wed, Sep 27, 2017 at 6:54 AM Sean Owen wrote: +1 I tested the source release. Hashes and signature (your signature) check out, project builds and tests pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9. List of issues look good and there are no open issues at all for 2.1.2. Great work on improving the build process and docs. On Wed, Sep 27, 2017 at 5:47 AM Holden Karau wrote: Please vote on releasing the following candidate as Apache Spark version 2.1.2. The vote is open until Wednesday October 4th at 23:59 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see https://spark.apache.org/ The tag to be voted on is v2.1.2-rc2 ( fabbb7f59e47590114366d14e15fbbff8c88593c) List of JIRA tickets resolved in this release can be found with this filter. The release files, including signatures, digests, etc. can be found at: https://home.apache.org/~holden/spark-2.1.2-rc2-bin/ Release artifacts are signed with a key from: https://people.apache.org/~holden/holdens_keys.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1251 The documentation corresponding to this release can be found at: https://people.apache.org/~holden/spark-2.1.2-rc2-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. If you're working in PySpark you can set up a virtual env and install the current RC and see if anything important breaks, in the Java/Scala you can add the staging repository to your projects resolvers and test with the RC (make sure to clean up the artifact cache before/after so you don't end up building with a out of date RC going forward). What should happen to JIRA tickets still targeting 2.1.2? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.1.3. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1. That being said if there is something which is a regression form 2.1.1 that has not been correctly targeted please ping a committer to help target the issue (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2) What are the unresolved issues targeted for 2.1.2? At this time there are no open unresolved issues. Is there anything different about this release? This is the first release in awhile not built on the AMPLAB Jenkins. This is good because it means future
Re: Welcoming Saisai (Jerry) Shao as a committer
Congratulations, Jerry! Kazuaki Ishizaki From: Hyukjin Kwon To: dev Date: 2017/08/29 12:24 Subject:Re: Welcoming Saisai (Jerry) Shao as a committer Congratulations! Very well deserved. 2017-08-29 11:41 GMT+09:00 Liwei Lin : Congratulations, Jerry! Cheers, Liwei On Tue, Aug 29, 2017 at 10:15 AM, 蒋星博 wrote: congs! Takeshi Yamamuro 于2017年8月28日 周一下午7:11写道: Congrats! On Tue, Aug 29, 2017 at 11:04 AM, zhichao wrote: Congratulations, Jerry! On Tue, Aug 29, 2017 at 9:57 AM, Weiqing Yang wrote: Congratulations, Jerry! On Mon, Aug 28, 2017 at 6:44 PM, Yanbo Liang wrote: Congratulations, Jerry. On Tue, Aug 29, 2017 at 9:42 AM, John Deng wrote: Congratulations, Jerry ! On 8/29/2017 09:28,Matei Zaharia wrote: Hi everyone, The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has been contributing to many areas of the project for a long time, so it ’s great to see him join. Join me in thanking and congratulating him! Matei - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- --- Takeshi Yamamuro
Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers
Congratulation, Hyukjin and Sameer, well deserved!! Kazuaki Ishizaki From: Matei Zaharia To: dev Date: 2017/08/08 00:53 Subject:Welcoming Hyukjin Kwon and Sameer Agarwal as committers Hi everyone, The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as committers. Join me in congratulating both of them and thanking them for their contributions to the project! Matei - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Apache Spark 2.2.0 (RC6)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core/sql-core/sql-catalyst/mllib/mllib-local have passed. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 -T 24 clean package install % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local ... Run completed in 15 minutes, 3 seconds. Total number of tests run: 1113 Suites: completed 170, aborted 0 Tests: succeeded 1113, failed 0, canceled 0, ignored 6, pending 0 All tests passed. [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Core . SUCCESS [17:24 min] [INFO] Spark Project ML Local Library . SUCCESS [ 7.161 s] [INFO] Spark Project Catalyst . SUCCESS [11:55 min] [INFO] Spark Project SQL .. SUCCESS [18:38 min] [INFO] Spark Project ML Library ... SUCCESS [18:17 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:06 h [INFO] Finished at: 2017-07-01T15:20:04+09:00 [INFO] Final Memory: 56M/591M [INFO] [WARNING] The requested profile "hive" could not be activated because it does not exist. Kazuaki Ishizaki From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/07/01 10:45 Subject:[VOTE] Apache Spark 2.2.0 (RC6) Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see https://spark.apache.org/ The tag to be voted on is v2.2.0-rc6 ( a2c7b2133cfee7fa9abfaa2bfbfb637155466783) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1245/ The documentation corresponding to this release can be found at: https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.2.0? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
Re: [VOTE] Apache Spark 2.2.0 (RC4)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core have passed. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package install $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core ... Run completed in 15 minutes, 30 seconds. Total number of tests run: 1959 Suites: completed 206, aborted 0 Tests: succeeded 1959, failed 0, canceled 4, ignored 8, pending 0 All tests passed. [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 17:16 min [INFO] Finished at: 2017-06-06T13:44:48+09:00 [INFO] Final Memory: 53M/510M [INFO] [WARNING] The requested profile "hive" could not be activated because it does not exist. Kazuaki Ishizaki From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/06/06 04:15 Subject:[VOTE] Apache Spark 2.2.0 (RC4) Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.2.0-rc4 ( 377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1241/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.2.0? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
Re: [build system] jenkins got itself wedged...
It looked well these days. However, it seems to go down slowly again... When I tried to see console log (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull ), a server returns "proxy error." Regards, Kazuaki Ishizaki From: shane knapp To: Sean Owen Cc: dev Date: 2017/05/20 09:43 Subject:Re: [build system] jenkins got itself wedged... last update of the week: things are looking great... we're GCing happily and staying well within our memory limits. i'm going to do one more restart after the two pull request builds finish to re-enable backups, and call it a weekend. :) shane On Fri, May 19, 2017 at 8:29 AM, shane knapp wrote: > this is hopefully my final email on the subject... :) > > things have seemed to settled down after my GC tuning, and system > load/cpu usage/memory has been nice and flat all night. i'll continue > to keep an eye on things but it looks like we've weathered the worst > part of the storm. > > On Thu, May 18, 2017 at 6:40 PM, shane knapp wrote: >> after needing another restart this afternoon, i did some homework and >> aggressively twiddled some GC settings[1]. since then, things have >> definitely smoothed out w/regards to memory and cpu usage spikes. >> >> i've attached a screenshot of slightly happier looking graphs. >> >> still keeping an eye on things, and hoping that i can go back to being >> a lurker... ;) >> >> shane >> >> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/ >> >> On Thu, May 18, 2017 at 11:20 AM, shane knapp wrote: >>> ok, more updates: >>> >>> 1) i audited all of the builds, and found that the spark-*-compile-* >>> and spark-*-test-* jobs were set to the identical cron time trigger, >>> so josh rosen and i updated them to run at H/5 (instead of */5). load >>> balancing ftw. >>> >>> 2) the jenkins master is now running on java8, which has moar bettar >>> GC management under the hood. >>> >>> i'll be keeping an eye on this today, and if we start seeing GC >>> overhead failures, i'll start doing more GC performance tuning. >>> thankfully, cloudbees has a relatively decent guide that i'll be >>> following here: https://jenkins.io/blog/2016/11/21/gc-tuning/ >>> >>> shane >>> >>> On Thu, May 18, 2017 at 8:39 AM, shane knapp wrote: >>>> yeah, i spoke too soon. jenkins is still misbehaving, but FINALLY i'm >>>> getting some error messages in the logs... looks like jenkins is >>>> thrashing on GC. >>>> >>>> now that i know what's up, i should be able to get this sorted today. >>>> >>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen wrote: >>>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. For >>>>> example, triggering it through the spark-prs.appspot.com UI gives me... >>>>> >>>>> https://spark-prs.appspot.com/trigger-jenkins/18012 >>>>> >>>>> Internal Server Error >>>>> >>>>> That might be from the appspot app though? >>>>> >>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I >>>>> can't reach Jenkins: >>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >>>>> >>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp wrote: >>>>>> >>>>>> after another couple of restarts due to high load and system >>>>>> unresponsiveness, i finally found what is the most likely culprit: >>>>>> >>>>>> a typo in the jenkins config where the java heap size was configured. >>>>>> instead of -Xmx16g, we had -Dmx16G... which could easily explain the >>>>>> random and non-deterministic system hangs we've had over the past >>>>>> couple of years. >>>>>> >>>>>> anyways, it's been corrected and the master seems to be humming along, >>>>>> for real this time, w/o issue. i'll continue to keep an eye on this >>>>>> for the rest of the week, but things are looking MUCH better now. >>>>>> >>>>>> sorry again for the interruptions in service. >>>>>> >>>>>> shane >>>>>> >>>>
Re: [VOTE] Apache Spark 2.2.0 (RC2)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core have passed. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package install $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core ... Run completed in 15 minutes, 12 seconds. Total number of tests run: 1940 Suites: completed 206, aborted 0 Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0 All tests passed. [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 16:51 min [INFO] Finished at: 2017-05-09T17:51:04+09:00 [INFO] Final Memory: 53M/514M [INFO] [WARNING] The requested profile "hive" could not be activated because it does not exist. Kazuaki Ishizaki, From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/05/05 02:08 Subject:[VOTE] Apache Spark 2.2.0 (RC2) Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.2.0-rc2 ( 1d4017b44d5e6ad156abeaae6371747f111dd1f9) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1236/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc2-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.2.0? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
Re: [VOTE] Apache Spark 2.2.0 (RC1)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core have passed.. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package install $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core ... Run completed in 15 minutes, 45 seconds. Total number of tests run: 1937 Suites: completed 205, aborted 0 Tests: succeeded 1937, failed 0, canceled 4, ignored 8, pending 0 All tests passed. [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 17:26 min [INFO] Finished at: 2017-04-29T02:23:08+09:00 [INFO] Final Memory: 53M/491M [INFO] ---- Kazuaki Ishizaki, From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/04/28 03:32 Subject:[VOTE] Apache Spark 2.2.0 (RC1) Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.2.0-rc1 ( 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1235/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.2.0? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
Re: [VOTE] Apache Spark 2.1.1 (RC4)
+1 (non-binding) I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core have passed.. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package install $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core ... Total number of tests run: 1788 Suites: completed 198, aborted 0 Tests: succeeded 1788, failed 0, canceled 4, ignored 8, pending 0 All tests passed. [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 16:30 min [INFO] Finished at: 2017-04-29T01:02:29+09:00 [INFO] Final Memory: 54M/576M [INFO] Regards, Kazuaki Ishizaki, From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/04/27 09:30 Subject:[VOTE] Apache Spark 2.1.1 (RC4) Please vote on releasing the following candidate as Apache Spark version 2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.1.1-rc4 ( 267aca5bd5042303a718d10635bc0d1a1596853f) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1232/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.1.1? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.0. What happened to RC1? There were issues with the release packaging and as a result was skipped.
Re: [VOTE] Apache Spark 2.1.1 (RC3)
+1 (non-binding) I tested it on Ubuntu 16.04 and openjdk8 on ppc64le. All of the tests for core have passed.. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package install $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core ... Total number of tests run: 1788 Suites: completed 198, aborted 0 Tests: succeeded 1788, failed 0, canceled 4, ignored 8, pending 0 All tests passed. [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 16:38 min [INFO] Finished at: 2017-04-19T18:17:43+09:00 [INFO] Final Memory: 56M/672M [INFO] Regards, Kazuaki Ishizaki, From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/04/19 04:00 Subject:[VOTE] Apache Spark 2.1.1 (RC3) Please vote on releasing the following candidate as Apache Spark version 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.1.1-rc3 ( 2ed19cff2f6ab79a718526e5d16633412d8c4dd4) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1230/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.1.1? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.0. What happened to RC1? There were issues with the release packaging and as a result was skipped.
Re: [VOTE] Apache Spark 2.1.1 (RC2)
Thank you. Yes, it is not a regression. 2.1.0 would have this failure, too. Regards, Kazuaki Ishizaki From: Sean Owen To: Kazuaki Ishizaki/Japan/IBM@IBMJP, Michael Armbrust Cc: "dev@spark.apache.org" Date: 2017/04/02 18:18 Subject:Re: [VOTE] Apache Spark 2.1.1 (RC2) That backport is fine, for another RC even in my opinion, but it's not a regression. It's a JDK bug really. 2.1.0 would have failed too. On Sun, Apr 2, 2017 at 8:20 AM Kazuaki Ishizaki wrote: -1 (non-binding) I tested it on Ubuntu 16.04 and openjdk8 on ppc64le. I got several errors. I expect that this backport (https://github.com/apache/spark/pull/17509) will be integrated into Spark 2.1.1.
Re: [VOTE] Apache Spark 2.1.1 (RC2)
-1 (non-binding) I tested it on Ubuntu 16.04 and openjdk8 on ppc64le. I got several errors. I expect that this backport (https://github.com/apache/spark/pull/17509) will be integrated into Spark 2.1.1. $ java -version openjdk version "1.8.0_111" OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14) OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package install $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core ... --- T E S T S --- OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Running org.apache.spark.memory.TaskMemoryManagerSuite Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.445 sec <<< FAILURE! - in org.apache.spark.memory.TaskMemoryManagerSuite encodePageNumberAndOffsetOffHeap(org.apache.spark.memory.TaskMemoryManagerSuite) Time elapsed: 0.007 sec <<< ERROR! java.lang.IllegalArgumentException: requirement failed: No support for unaligned Unsafe. Set spark.memory.offHeap.enabled to false. at org.apache.spark.memory.TaskMemoryManagerSuite.encodePageNumberAndOffsetOffHeap(TaskMemoryManagerSuite.java:48) offHeapConfigurationBackwardsCompatibility(org.apache.spark.memory.TaskMemoryManagerSuite) Time elapsed: 0.013 sec <<< ERROR! java.lang.IllegalArgumentException: requirement failed: No support for unaligned Unsafe. Set spark.memory.offHeap.enabled to false. at org.apache.spark.memory.TaskMemoryManagerSuite.offHeapConfigurationBackwardsCompatibility(TaskMemoryManagerSuite.java:138) Running org.apache.spark.io.NioBufferedFileInputStreamSuite Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.029 sec - in org.apache.spark.io.NioBufferedFileInputStreamSuite Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.708 sec <<< FAILURE! - in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite testPeakMemoryUsed(org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite) Time elapsed: 0.006 sec <<< FAILURE! java.lang.AssertionError: expected:<16648> but was:<16912> Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Tests run: 13, Failures: 0, Errors: 13, Skipped: 0, Time elapsed: 0.043 sec <<< FAILURE! - in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite failureToGrow(org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite) Time elapsed: 0.002 sec <<< ERROR! java.lang.IllegalArgumentException: requirement failed: No support for unaligned Unsafe. Set spark.memory.offHeap.enabled to false. ... Tests run: 207, Failures: 7, Errors: 16, Skipped: 0 Kazuaki Ishizaki From: Michael Armbrust To: "dev@spark.apache.org" Date: 2017/03/31 08:10 Subject:[VOTE] Apache Spark 2.1.1 (RC2) Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.1.1-rc2 ( 02b165dcc2ee5245d1293a375a31660c9d4e1fa6) List of JIRA tickets resolved can be found with this filter. The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1227/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/ FAQ How can I help test this release? If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What should happen to JIRA tickets still targeting 2.1.1? Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0. But my bug isn't fixed!??! In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.0. What happened to RC1? There were issues with the release packaging and as a result was skipped.
Re: Why are DataFrames always read with nullable=True?
Hi, Regarding reading part for nullable, it seems to be considered to add a data cleaning step as Xiao said at https://www.mail-archive.com/user@spark.apache.org/msg39233.html. Here is a PR https://github.com/apache/spark/pull/17293 to add the data cleaning step that throws an exception if null exists in non-null column. Any comments are appreciated. Kazuaki Ishizaki From: Jason White To: dev@spark.apache.org Date: 2017/03/21 06:31 Subject:Why are DataFrames always read with nullable=True? If I create a dataframe in Spark with non-nullable columns, and then save that to disk as a Parquet file, the columns are properly marked as non-nullable. I confirmed this using parquet-tools. Then, when loading it back, Spark forces the nullable back to True. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378 If I remove the `.asNullable` part, Spark performs exactly as I'd like by default, picking up the data using the schema either in the Parquet file or provided by me. This particular LoC goes back a year now, and I've seen a variety of discussions about this issue. In particular with Michael here: https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those seemed to be discussing writing, not reading, though, and writing is already supported now. Is this functionality still desirable? Is it potentially not applicable for all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to pass an option to the DataFrameReader to disable this functionality? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: A DataFrame cache bug
Hi, Thank you for pointing out the JIRA. I think that this JIRA suggests you to insert "spark.catalog.refreshByPath(dir)". val dir = "/tmp/test" spark.range(100).write.mode("overwrite").parquet(dir) val df = spark.read.parquet(dir) df.count // output 100 which is correct f(df).count // output 89 which is correct spark.range(1000).write.mode("overwrite").parquet(dir) spark.catalog.refreshByPath(dir) // insert a NEW statement val df1 = spark.read.parquet(dir) df1.count // output 1000 which is correct, in fact other operation expect df1.filter("id>10") return correct result. f(df1).count // output 89 which is incorrect Regards, Kazuaki Ishizaki From: gen tang To: dev@spark.apache.org Date: 2017/02/22 15:02 Subject:Re: A DataFrame cache bug Hi All, I might find a related issue on jira: https://issues.apache.org/jira/browse/SPARK-15678 This issue is closed, may be we should reopen it. Thanks Cheers Gen On Wed, Feb 22, 2017 at 1:57 PM, gen tang wrote: Hi All, I found a strange bug which is related with reading data from a updated path and cache operation. Please consider the following code: import org.apache.spark.sql.DataFrame def f(data: DataFrame): DataFrame = { val df = data.filter("id>10") df.cache df.count df } f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is correct val dir = "/tmp/test" spark.range(100).write.mode("overwrite").parquet(dir) val df = spark.read.parquet(dir) df.count // output 100 which is correct f(df).count // output 89 which is correct spark.range(1000).write.mode("overwrite").parquet(dir) val df1 = spark.read.parquet(dir) df1.count // output 1000 which is correct, in fact other operation expect df1.filter("id>10") return correct result. f(df1).count // output 89 which is incorrect In fact when we use df1.filter("id>10"), spark will however use old cached dataFrame Any idea? Thanks a lot Cheers Gen
Re: welcoming Takuya Ueshin as a new Apache Spark committer
Congrats! Kazuaki Ishizaki From: Reynold Xin To: "dev@spark.apache.org" Date: 2017/02/14 04:18 Subject:welcoming Takuya Ueshin as a new Apache Spark committer Hi all, Takuya-san has recently been elected an Apache Spark committer. He's been active in the SQL area and writes very small, surgical patches that are high quality. Please join me in congratulating Takuya-san!
Re: Spark performance tests
Hi, You may find several micro-benchmarks under https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark . Regards, Kazuaki Ishizaki From: Prasun Ratn To: Apache Spark Dev Date: 2017/01/10 12:52 Subject:Spark performance tests Hi Are there performance tests or microbenchmarks for Spark - especially directed towards the CPU specific parts? I looked at spark-perf but that doesn't seem to have been updated recently. Thanks Prasun - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Quick request: prolific PR openers, review your open PRs
Sure, I updated status of some PRs. Regards, Kazuaki Ishizaki From: Sean Owen To: dev Date: 2017/01/04 21:37 Subject:Quick request: prolific PR openers, review your open PRs Just saw that there are many people with >= 8 open PRs. Some are legitimately in flight but many are probably stale. To set a good example, would (everyone) mind flicking through what they've got open and see if some PRs are stale and should be closed? https://spark-prs.appspot.com/users Username Open PRs ▴ viirya 13 hhbyyh 12 zhengruifeng 12 HyukjinKwon 12 maropu 10 kiszk 10 yanboliang 10 cloud-fan 8 jerryshao 8
Sharing data in columnar storage between two applications
Here is an interesting discussion to share data in columnar storage between two applications. https://github.com/apache/spark/pull/15219#issuecomment-265835049 One of the ideas is to prepare interfaces (or trait) only for read or write. Each application can implement only one class to want to do (e.g. read or write). For example, FiloDB wants to provide a columnar storage that can be read from Spark. In that case, it is easy to implement only read APIs for Spark. These two classes can be prepared. However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch keeps a set of ColumnVector that can be read or written. The ColumnVector class should have read and write APIs. How can we put the new ColumnVector with only read APIs? Here is an example to case incompatibility at https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d Another possible idea is that both applications supports Apache Arrow APIs. Other approaches could be. What approach would be good for all of applications? Regards, Kazuaki Ishizaki
Re: Reduce memory usage of UnsafeInMemorySorter
The line where I pointed out would work correctly. This is because a type of this division is double. d2i correctly handles overflow cases. Kazuaki Ishizaki From: Nicholas Chammas To: Kazuaki Ishizaki/Japan/IBM@IBMJP, Reynold Xin Cc: Spark dev list Date: 2016/12/08 10:56 Subject:Re: Reduce memory usage of UnsafeInMemorySorter Unfortunately, I don't have a repro, and I'm only seeing this at scale. But I was able to get around the issue by fiddling with the distribution of my data before asking GraphFrames to process it. (I think that's where the error was being thrown from.) On Wed, Dec 7, 2016 at 7:32 AM Kazuaki Ishizaki wrote: I do not have a repro, too. But, when I took a quick browse at the file 'UnsafeInMemorySort.java', I am afraid about the similar cast issue like https://issues.apache.org/jira/browse/SPARK-18458at the following line. https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156 Regards, Kazuaki Ishizaki From:Reynold Xin To:Nicholas Chammas Cc:Spark dev list Date:2016/12/07 14:27 Subject:Re: Reduce memory usage of UnsafeInMemorySorter This is not supposed to happen. Do you have a repro? On Tue, Dec 6, 2016 at 6:11 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: [Re-titling thread.] OK, I see that the exception from my original email is being triggered from this part of UnsafeInMemorySorter: https://github.com/apache/spark/blob/v2.0.2/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L209-L212 So I can ask a more refined question now: How can I ensure that UnsafeInMemorySorterhas room to insert new records? In other words, how can I ensure that hasSpaceForAnotherRecord()returns a true value? Do I need: More, smaller partitions? More memory per executor? Some Java or Spark option enabled? etc. I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here? (Unfortunately, I cannot upgrade at this time, but it would be good to know regardless.) This is morphing into a user-list question, so accept my apologies. Since I can’t find any information anywhere else about this, and the question is about internals like UnsafeInMemorySorter, I hope this is OK here. Nick On Mon, Dec 5, 2016 at 9:11 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: I was testing out a new project at scale on Spark 2.0.2 running on YARN, and my job failed with an interesting error message: TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684, server.host.name ): java.lang.IllegalStateException: There is no space for new record 05:27:09.573 at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:211) 05:27:09.574 at org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:127) 05:27:09.574 at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244) 05:27:09.575 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) 05:27:09.575 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 05:27:09.576 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 05:27:09.576 at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$8$anon$1.hasNext(WholeStageCodegenExec.scala:370) 05:27:09.577 at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408) 05:27:09.577 at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) 05:27:09.577 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 05:27:09.578 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) 05:27:09.578 at org.apache.spark.scheduler.Task.run(Task.scala:86) 05:27:09.578 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 05:27:09.579 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 05:27:09.579 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 05:27:09.579 at java.lang.Thread.run(Thread.java:745) I’ve never seen this before, and searching on Google/DDG/JIRA doesn’t yield any results. There are no other errors coming from that executor, whether related to memory, storage space, or otherwise. Could this be a bug? If so, how would I narrow down the source? Otherwise, how might I work around the issue? Nick ? ?
Re: Reduce memory usage of UnsafeInMemorySorter
I do not have a repro, too. But, when I took a quick browse at the file 'UnsafeInMemorySort.java', I am afraid about the similar cast issue like https://issues.apache.org/jira/browse/SPARK-18458 at the following line. https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156 Regards, Kazuaki Ishizaki From: Reynold Xin To: Nicholas Chammas Cc: Spark dev list Date: 2016/12/07 14:27 Subject:Re: Reduce memory usage of UnsafeInMemorySorter This is not supposed to happen. Do you have a repro? On Tue, Dec 6, 2016 at 6:11 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: [Re-titling thread.] OK, I see that the exception from my original email is being triggered from this part of UnsafeInMemorySorter: https://github.com/apache/spark/blob/v2.0.2/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L209-L212 So I can ask a more refined question now: How can I ensure that UnsafeInMemorySorter has room to insert new records? In other words, how can I ensure that hasSpaceForAnotherRecord() returns a true value? Do I need: More, smaller partitions? More memory per executor? Some Java or Spark option enabled? etc. I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here? (Unfortunately, I cannot upgrade at this time, but it would be good to know regardless.) This is morphing into a user-list question, so accept my apologies. Since I can’t find any information anywhere else about this, and the question is about internals like UnsafeInMemorySorter, I hope this is OK here. Nick On Mon, Dec 5, 2016 at 9:11 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: I was testing out a new project at scale on Spark 2.0.2 running on YARN, and my job failed with an interesting error message: TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684, server.host.name ): java.lang.IllegalStateException: There is no space for new record 05:27:09.573 at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:211) 05:27:09.574 at org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:127) 05:27:09.574 at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244) 05:27:09.575 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) 05:27:09.575 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 05:27:09.576 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 05:27:09.576 at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$8$anon$1.hasNext(WholeStageCodegenExec.scala:370) 05:27:09.577 at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:408) 05:27:09.577 at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) 05:27:09.577 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 05:27:09.578 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) 05:27:09.578 at org.apache.spark.scheduler.Task.run(Task.scala:86) 05:27:09.578 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 05:27:09.579 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 05:27:09.579 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 05:27:09.579 at java.lang.Thread.run(Thread.java:745) I’ve never seen this before, and searching on Google/DDG/JIRA doesn’t yield any results. There are no other errors coming from that executor, whether related to memory, storage space, or otherwise. Could this be a bug? If so, how would I narrow down the source? Otherwise, how might I work around the issue? Nick
Re: Cache'ing performance
Hi, Good point. I have just measured performance with "spark.sql.inMemoryColumnarStorage.compressed=false." It improved the performance than with default. However, it is still slower RDD version on my environment. It seems to be consistent with the PR https://github.com/apache/spark/pull/11956. This PR shows room to performance improvement for float/double values that are not compressed. Kazuaki Ishizaki From: linguin@gmail.com To: Maciej Bry��ski Cc: Spark dev list Date: 2016/08/28 11:30 Subject:Re: Cache'ing performance Hi, How does the performance difference change when turning off compression? It is enabled by default. // maropu Sent by iPhone 2016/08/28 10:13、Kazuaki Ishizaki のメッセ�`ジ: Hi I think that it is a performance issue in both DataFrame and Dataset cache. It is not due to only Encoders. The DataFrame version "spark.range(Int.MaxValue).toDF.cache().count()" is also slow. While a cache for DataFrame and Dataset is stored as a columnar format with some compressed data representation, we have revealed there is room to improve performance. We have already created pull requests to address them. These pull requests are under review. https://github.com/apache/spark/pull/11956 https://github.com/apache/spark/pull/14091 We would appreciate your feedback to these pull requests. Best Regards, Kazuaki Ishizaki From:Maciej Bry��ski To:Spark dev list Date:2016/08/28 05:40 Subject:Cache'ing performance Hi, I did some benchmark of cache function today. RDD sc.parallelize(0 until Int.MaxValue).cache().count() Datasets spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) RDD - 6s Datasets - 13,5 s Is that expected behavior for Datasets and Encoders ? Regards, -- Maciek Bry��ski
Re: Cache'ing performance
Hi I think that it is a performance issue in both DataFrame and Dataset cache. It is not due to only Encoders. The DataFrame version "spark.range(Int.MaxValue).toDF.cache().count()" is also slow. While a cache for DataFrame and Dataset is stored as a columnar format with some compressed data representation, we have revealed there is room to improve performance. We have already created pull requests to address them. These pull requests are under review. https://github.com/apache/spark/pull/11956 https://github.com/apache/spark/pull/14091 We would appreciate your feedback to these pull requests. Best Regards, Kazuaki Ishizaki From: Maciej Bryński To: Spark dev list Date: 2016/08/28 05:40 Subject:Cache'ing performance Hi, I did some benchmark of cache function today. RDD sc.parallelize(0 until Int.MaxValue).cache().count() Datasets spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) RDD - 6s Datasets - 13,5 s Is that expected behavior for Datasets and Encoders ? Regards, -- Maciek Bryński
Question about equality of o.a.s.sql.Row
Dear all, I have three questions about equality of org.apache.spark.sql.Row. (1) If a Row has a complex type (e.g. Array), is the following behavior expected? If two Rows has the same array instance, Row.equals returns true in the second assert. If two Rows has different array instances (a1 and a2) that have the same array elements, Row.equals returns false in the third assert. val a1 = Array(3, 4) val a2 = Array(3, 4) val r1 = Row(a1) val r2 = Row(a2) assert(a1.sameElements(a2)) // SUCCESS assert(Row(a1).equals(Row(a1))) // SUCCESS assert(Row(a1).equals(Row(a2))) // FAILURE This is because two objects are compared by "o1 != o2" instead of "o1.equals(o2)" at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L408 (2) If (1) is expected, where is this behavior is described or defined? I cannot find the description in the API document. https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.Row (3) If (3) is expected, is there any recommendation to write code of equality of two Rows that have an Array or complex types (e.g. Map)? Best Regards, Kazuaki Ishizaki, @kiszk
Re: How to access the off-heap representation of cached data in Spark 2.0
Hi, According to my understanding, contents in df.cache() is currently on Java heap as a set of Byte arrays in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L58 . Data is accessed by using sun.misc.unsafe APIs. Data maybe compressed sometime. CachedBatch is private, and this representation may be changed in the future. In general, It is not easy to access this data by using C/C++ API. Regards, Kazuaki Ishizaki From: Jacek Laskowski To: "jpivar...@gmail.com" Cc: dev Date: 2016/05/29 08:18 Subject:Re: How to access the off-heap representation of cached data in Spark 2.0 Hi Jim, There's no C++ API in Spark to access the off-heap data. Moreover, I also think "off-heap" has an overloaded meaning in Spark - for tungsten and to persist your data off-heap (it's all about memory but for different purposes and with client- and internal API). That's my limited understanding of the things (and I'm not even sure how trustworthy it is). Use with extreme caution. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, May 28, 2016 at 5:29 PM, jpivar...@gmail.com wrote: > Is this not the place to ask such questions? Where can I get a hint as to how > to access the new off-heap cache, or C++ API, if it exists? I'm willing to > do my own research, but I have to have a place to start. (In fact, this is > the first step in that research.) > > Thanks, > -- Jim > > > > > -- > View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-access-the-off-heap-representation-of-cached-data-in-Spark-2-0-tp17701p17717.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Recent Jenkins always fails in specific two tests
I realized that recent Jenkins among different pull requests always fails in the following two tests "SPARK-8020: set sql conf in spark conf" "SPARK-9757 Persist Parquet relation with decimal column" Here are examples. https://github.com/apache/spark/pull/11956 (consoleFull: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56058/consoleFull ) https://github.com/apache/spark/pull/12259 (consoleFull: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56056/consoleFull ) https://github.com/apache/spark/pull/12450 (consoleFull: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56051/consoleFull ) https://github.com/apache/spark/pull/12453 (consoleFull: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56050/consoleFull ) https://github.com/apache/spark/pull/12257 (consoleFull: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56061/consoleFull ) https://github.com/apache/spark/pull/12451 (consoleFull: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56045/consoleFull ) I have just realized that the latest master also causes the same two failures at amplab Jenkins. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/ Since they seem to have some relationships with failures in recent pull requests, I created two JIRA entries. https://issues.apache.org/jira/browse/SPARK-14689 https://issues.apache.org/jira/browse/SPARK-14690 Best regards, Kazuaki Ishizaki
RE: Using CUDA within Spark / boosting linear algebra
Hi Alexander, The goal of our columnar to effectively drive GPUs in Spark. One of important items is to effectively and easily enable highly-tuned libraries for GPU such as BIDMach. We will enable BIDMach with our columnar storage. On the other hand, it is not easy task to scaling BIDMach with current Spark. I expect that this talk would help us. http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47565 We appreciate your great feedback. Best Regards, Kazuaki Ishizaki, Ph.D., Senior research staff member, IBM Research - Tokyo From: "Ulanov, Alexander" To: Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" , Joseph Bradley Cc: John Canny , "Evan R. Sparks" , Xiangrui Meng , Sam Halliday Date: 2016/01/22 04:20 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Kazuaki, Indeed, moving data to/from GPU is costly and this benchmark summarizes the costs for moving different data sizes with regards to matrices multiplication. These costs are paid for the convenience of using the standard BLAS API that Nvidia NVBLAS provides. The thing is that there are no code changes required (in Spark), one just needs to reference BLAS implementation with the system variable. Naturally, hardware-specific implementation will always be faster than default. The benchmark results show that fact by comparing jCuda (by means of BIDMat) and NVBLAS. However, it also shows that it worth using NVBLAS for large matrices because it can take advantage of several GPUs and it will be faster despite the copying overhead. That is also a known thing advertised by Nvidia. By the way, I don’t think that the column/row friendly format is an issue, because one can use transposed matrices to fit the required format. I believe that is just a software preference. My suggestion with regards to your prototype would be to make comparisons with Spark’s implementation of logistic regression (that does not take advantage of GPU) and also with BIDMach’s (that takes advantage of GPUs). It will give the users a better understanding of your’s implementation performance. Currently you compare it with Spark’s example logistic regression implementation that is supposed to be a reference for learning Spark rather than benchmarking its performance. Best regards, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Thursday, January 21, 2016 3:34 AM To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday Subject: RE: Using CUDA within Spark / boosting linear algebra Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From:"Ulanov, Alexander" To:Sam Halliday , John Canny < ca...@berkeley.edu> Cc:Xiangrui Meng , "dev@spark.apache.org" < dev@spark.apache.org>, Joseph Bradley , "Evan R. Sparks" Date:2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com]
RE: Using CUDA within Spark / boosting linear algebra
Hi Allen, Thank you for your feedback. An API to launch GPU kernels with JCuda is the our first step. A purpose to release our prototype is to get feedback. In the future, we may use other wrappers instead of JCuda. We are very appreciate it if you would suggest or propose APIs to effectively exploit GPUs such as BIDMat in Spark. If we would run BIDMat with our columnar strorage, the performance boost would be good as others reported. Best Regards, Kazuaki Ishizaki, From: "Allen Zhang" To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: "dev@spark.apache.org" , "Ulanov, Alexander" , "Joseph Bradley" , "John Canny" , "Evan R. Sparks" , "Xiangrui Meng" , "Sam Halliday" Date: 2016/01/21 21:05 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Kazuaki, Jcuda is actually a wrapper of the **pure** CUDA, as your wiki page shows that 3.15x performance boost of logistic regression seems slower than BIDMat-cublas or pure CUDA. Could you elaborate on why you chose Jcuda other then JNI to call CUDA directly? Regards, Allen Zhang At 2016-01-21 19:34:14, "Kazuaki Ishizaki" wrote: Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From:"Ulanov, Alexander" To:Sam Halliday , John Canny < ca...@berkeley.edu> Cc:Xiangrui Meng , "dev@spark.apache.org" < dev@spark.apache.org>, Joseph Bradley , "Evan R. Sparks" Date:2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "John Canny" wrote: I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overall performance of an algorithm is determined by the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical problems, you need to make sure that every kernel goes at comparable speed. So the real question is how much faster MLLib routines do on a complete problem with/without GPU acceleration. For BIDMach, its close to a factor of 10. But that required running en
RE: Using CUDA within Spark / boosting linear algebra
Dear all, >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph As Joseph pointed out before, there are two potential issues to efficiently exploit GPUs in Spark. (1) the cost of data movement between CPU and GPU (2) the cost of encoding/decoding between current row-format and GPU-friendly column format Our prototype http://kiszk.github.io/spark-gpu/ addresses these two issues by supporting data partition caching in GPU device memory and by providing binary column storage for data partition. We really appreciate it if you would give us comments, suggestions, or feedback. Best Regards Kazuaki Ishizaki From: "Ulanov, Alexander" To: Sam Halliday , John Canny Cc: Xiangrui Meng , "dev@spark.apache.org" , Joseph Bradley , "Evan R. Sparks" Date: 2016/01/21 11:07 Subject:RE: Using CUDA within Spark / boosting linear algebra Hi Everyone, I’ve updated the benchmark and done experiments with new hardware with 2x Nvidia Tesla K80 (physically 4x Tesla K40) and 2x modern Haswell CPU Intel E5-2650 v3 @ 2.30GHz. This time I computed average and median of 10 runs for each of experiment and approximated FLOPS. Results are available at google docs (old experiments are in the other 2 sheets): https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Benchmark code: https://github.com/avulanov/scala-blas Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry, although your personal experience may be different. On 26 Mar 2015 16:20, "John Canny" wrote: I mentioned this earlier in the thread, but I'll put it out again. Dense BLAS are not very important for most machine learning workloads: at least for non-image workloads in industry (and for image processing you would probably want a deep learning/SGD solution with convolution kernels). e.g. it was only relevant for 1/7 of our recent benchmarks, which should be a reasonable sample. What really matters is sparse BLAS performance. BIDMat is still an order of magnitude faster there. Those kernels are only in BIDMat, since NVIDIAs sparse BLAS dont perform well on power-law data. Its also the case that the overall performance of an algorithm is determined by the slowest kernel, not the fastest. If the goal is to get closer to BIDMach's performance on typical problems, you need to make sure that every kernel goes at comparable speed. So the real question is how much faster MLLib routines do on a complete problem with/without GPU acceleration. For BIDMach, its close to a factor of 10. But that required running entirely on the GPU, and making sure every kernel is close to its limit. -John If you think nvblas would be helpful, you should try it in some end-to-end benchmarks. On 3/25/15, 6:23 PM, Evan R. Sparks wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander < alexander.ula...@hp.com> wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048
RE: Support off-loading computations to a GPU
Hi Alexander, Thank you for having an interest. We used a LR derived from a Spark sample program https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala (not from mllib or ml). Here are scala source files for GPU and non-GPU versions. GPU: https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala non-GPU: https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala Best Regards, Kazuaki Ishizaki From: "Ulanov, Alexander" To: Kazuaki Ishizaki/Japan/IBM@IBMJP, "dev@spark.apache.org" Date: 2016/01/05 06:13 Subject:RE: Support off-loading computations to a GPU Hi Kazuaki, Sounds very interesting! Could you elaborate on your benchmark with regards to logistic regression (LR)? Did you compare your implementation with the current implementation of LR in Spark? Best regards, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Sunday, January 03, 2016 7:52 PM To: dev@spark.apache.org Subject: Support off-loading computations to a GPU Dear all, We reopened the existing JIRA entry https://issues.apache.org/jira/browse/SPARK-3785to support off-loading computations to a GPU by adding a description for our prototype. We are working to effectively and easily exploit GPUs on Spark at http://github.com/kiszk/spark-gpu. Please also visit our project page http://kiszk.github.io/spark-gpu/. For now, we added a new format for a partition in an RDD, which is a column-based structure in an array format, in addition to the current Iterator[T] format with Seq[T]. This reduces data serialization/deserialization and copy overhead between CPU and GPU. Our prototype achieved more than 3x performance improvement for a simple logistic regression program using a NVIDIA K40 card. This JIRA entry (SPARK-3785) includes a link to a design document. We are very glad to hear valuable feedback/suggestions/comments and to have great discussions to exploit GPUs in Spark. Best Regards, Kazuaki Ishizaki
Re:Support off-loading computations to a GPU
Hi Allen, Thank you for having an interest. For quick start, I prepared a new page "Quick Start" at https://github.com/kiszk/spark-gpu/wiki/Quick-Start. You can install the package with two lines and run a sample program with one line. We mean that "off-loading" is to exploit GPU for a task execution of Spark. For this, it is necessary to map a task into GPU kernels (While the current version requires a programmer to write CUDA code, future versions will prepare GPU code from a Spark program automatically). To execute GPU kernels requires data copy between CPU and GPU. To reduce data copy overhead, our prototype keeps data as a binary representation in RDD using a column format. The current version does not specify the number of CUDA cores for a job by using a command line option. There are two ways to specify resources in GPU. 1) to specify the number of GPU cards by setting CUDA_VISIBLE_DEVICES in conf/spark-env.sh (refer to http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ ) 2) to specify the number of CUDA threads for processing a partition in a program as https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala#L89 (Sorry for no documentation now). We are glad to support requested features or to looking forward to getting pull requests. Best Regard, Kazuaki Ishizaki From: "Allen Zhang" To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: dev@spark.apache.org Date: 2016/01/04 13:29 Subject:Re:Support off-loading computations to a GPU Hi Kazuaki, I am looking at http://kiszk.github.io/spark-gpu/ , can you point me where is the kick-start scripts that I can give it a go? to be more specifically, what does *"off-loading"* mean? aims to reduce the copy overhead between CPU and GPU? I am a newbie for GPU, how can I specify how many GPU cores I want to use (like --executor-cores) ? At 2016-01-04 11:52:01, "Kazuaki Ishizaki" wrote: Dear all, We reopened the existing JIRA entry https://issues.apache.org/jira/browse/SPARK-3785to support off-loading computations to a GPU by adding a description for our prototype. We are working to effectively and easily exploit GPUs on Spark at http://github.com/kiszk/spark-gpu. Please also visit our project page http://kiszk.github.io/spark-gpu/. For now, we added a new format for a partition in an RDD, which is a column-based structure in an array format, in addition to the current Iterator[T] format with Seq[T]. This reduces data serialization/deserialization and copy overhead between CPU and GPU. Our prototype achieved more than 3x performance improvement for a simple logistic regression program using a NVIDIA K40 card. This JIRA entry (SPARK-3785) includes a link to a design document. We are very glad to hear valuable feedback/suggestions/comments and to have great discussions to exploit GPUs in Spark. Best Regards, Kazuaki Ishizaki
Re: Support off-loading computations to a GPU
I created a new JIRA entry https://issues.apache.org/jira/browse/SPARK-12620 for this instead of reopening the existing JIRA based on the suggestion. Best Regards, Kazuaki Ishizaki From: Kazuaki Ishizaki/Japan/IBM@IBMJP To: dev@spark.apache.org Date: 2016/01/04 12:54 Subject:Support off-loading computations to a GPU Dear all, We reopened the existing JIRA entry https://issues.apache.org/jira/browse/SPARK-3785to support off-loading computations to a GPU by adding a description for our prototype. We are working to effectively and easily exploit GPUs on Spark at http://github.com/kiszk/spark-gpu. Please also visit our project page http://kiszk.github.io/spark-gpu/. For now, we added a new format for a partition in an RDD, which is a column-based structure in an array format, in addition to the current Iterator[T] format with Seq[T]. This reduces data serialization/deserialization and copy overhead between CPU and GPU. Our prototype achieved more than 3x performance improvement for a simple logistic regression program using a NVIDIA K40 card. This JIRA entry (SPARK-3785) includes a link to a design document. We are very glad to hear valuable feedback/suggestions/comments and to have great discussions to exploit GPUs in Spark. Best Regards, Kazuaki Ishizaki
Support off-loading computations to a GPU
Dear all, We reopened the existing JIRA entry https://issues.apache.org/jira/browse/SPARK-3785 to support off-loading computations to a GPU by adding a description for our prototype. We are working to effectively and easily exploit GPUs on Spark at http://github.com/kiszk/spark-gpu. Please also visit our project page http://kiszk.github.io/spark-gpu/. For now, we added a new format for a partition in an RDD, which is a column-based structure in an array format, in addition to the current Iterator[T] format with Seq[T]. This reduces data serialization/deserialization and copy overhead between CPU and GPU. Our prototype achieved more than 3x performance improvement for a simple logistic regression program using a NVIDIA K40 card. This JIRA entry (SPARK-3785) includes a link to a design document. We are very glad to hear valuable feedback/suggestions/comments and to have great discussions to exploit GPUs in Spark. Best Regards, Kazuaki Ishizaki
Re: latest Spark build error
This is because to build Spark requires maven 3.3.3 or later. http://spark.apache.org/docs/latest/building-spark.html Regards, Kazuaki Ishizaki From: salexln To: dev@spark.apache.org Date: 2015/12/25 15:52 Subject:latest Spark build error Hi all, I'm getting build error when trying to build a clean version of latest Spark. I did the following 1) git clone https://github.com/apache/spark.git 2) build/mvn -DskipTests clean package But I get the following error: Spark Project Parent POM .. FAILURE [2.338s] ... BUILD FAILURE ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce (enforce-versions) on project spark-parent_2.10: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException I'm running Lubuntu 14.04 with the following: java version "1.7.0_91" OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) Apache Maven 3.0.5 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/latest-Spark-build-error-tp15782.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Shared memory between C++ process and Spark
Is this JIRA entry related to what you want? https://issues.apache.org/jira/browse/SPARK-10399 Regards, Kazuaki Ishizaki From: Jia To: Dewful Cc: "user @spark" , dev@spark.apache.org, Robin East Date: 2015/12/08 03:17 Subject:Re: Shared memory between C++ process and Spark Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages. However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance. But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism. Best Regards, Jia On Dec 7, 2015, at 11:46 AM, Dewful wrote: Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support... Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list. Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory. Suggestions will be highly appreciated! Best Regards, Jia On Dec 7, 2015, at 10:54 AM, Robin East wrote: -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’ m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 6 Dec 2015, at 20:43, Jia wrote: Dears, for one project, I need to implement something so Spark can read data from a C++ process. To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process. It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this? Thank you very much! Best Regards, Jia - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org