Re: [VOTE] Release Spark 3.4.1 (RC1)
+1 Builds and runs fine on Java 17, macOS. $ ./dev/change-scala-version.sh 2.13 $ mvn \ -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano,connect \ -DskipTests \ clean install $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ... Tests passed in 28 second Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Tue, Jun 20, 2023 at 4:41 AM Dongjoon Hyun wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.4.1. > > The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC > votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 3.4.1 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see https://spark.apache.org/ > > The tag to be voted on is v3.4.1-rc1 (commit > 6b1ff22dde1ead51cbf370be6e48a802daae58b6) > https://github.com/apache/spark/tree/v3.4.1-rc1 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1443/ > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/ > > The list of bug fixes going into 3.4.1 can be found at the following URL: > https://issues.apache.org/jira/projects/SPARK/versions/12352874 > > This release is using the release script of the tag v3.4.1-rc1. > > FAQ > > = > How can I help test this release? > = > > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC and see if anything important breaks, in the Java/Scala > you can add the staging repository to your projects resolvers and test > with the RC (make sure to clean up the artifact cache before/after so > you don't end up building with a out of date RC going forward). > > === > What should happen to JIRA tickets still targeting 3.4.1? > === > > The current list of open tickets targeted at 3.4.1 can be found at: > https://issues.apache.org/jira/projects/SPARK and search for "Target > Version/s" = 3.4.1 > > Committers should look at those and triage. Extremely important bug > fixes, documentation, and API tweaks that impact compatibility should > be worked on immediately. Everything else please retarget to an > appropriate release. > > == > But my bug isn't fixed? > == > > In order to make timely releases, we will typically not hold the > release unless the bug in question is a regression from the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. >
Re: [VOTE][SPIP] PySpark Test Framework
+0 Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jun 21, 2023 at 5:11 PM Amanda Liu wrote: > Hi all, > > I'd like to start the vote for SPIP: PySpark Test Framework. > > The high-level summary for the SPIP is that it proposes an official test > framework for PySpark. Currently, there are only disparate open-source > repos and blog posts for PySpark testing resources. We can streamline and > simplify the testing process by incorporating test features, such as a > PySpark Test Base class (which allows tests to share Spark sessions) and > test util functions (for example, asserting dataframe and schema equality). > > *SPIP doc:* > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v > > *JIRA ticket:* https://issues.apache.org/jira/browse/SPARK-44042 > > *Discussion thread:* > https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n > > Please vote on the SPIP for the next 72 hours: > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because __. > > Thank you! > > Best, > Amanda Liu >
Re: [DISCUSS] SPIP: Python Data Source API
Hi Allison and devs, Although I was against this idea at first sight (probably because I'm a Scala dev), I think it could work as long as there are people who'd be interested in such an API. Were there any? I'm just curious. I've seen no emails requesting it. I also doubt that Python devs would like to work on new data sources but support their wishes wholeheartedly :) Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang wrote: > Hi everyone, > > I would like to start a discussion on “Python Data Source API”. > > This proposal aims to introduce a simple API in Python for Data Sources. > The idea is to enable Python developers to create data sources without > having to learn Scala or deal with the complexities of the current data > source APIs. The goal is to make a Python-based API that is simple and easy > to use, thus making Spark more accessible to the wider Python developer > community. This proposed approach is based on the recently introduced > Python user-defined table functions with extensions to support data sources. > > *SPIP Doc*: > https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing > > *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076 > > Looking forward to your feedback. > > Thanks, > Allison >
Re: [VOTE] Release Apache Spark 3.4.0 (RC7)
+1 * Built fine with Scala 2.13 and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano * Ran some demos on Java 17 * Mac mini / Apple M2 Pro / Ventura 13.3.1 Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng wrote: > Please vote on releasing the following candidate(RC7) as Apache Spark > version 3.4.0. > > The vote is open until 11:59pm Pacific time *April 12th* and passes if a > majority +1 PMC votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 3.4.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is v3.4.0-rc7 (commit > 87a5442f7ed96b11051d8a9333476d080054e5a0): > https://github.com/apache/spark/tree/v3.4.0-rc7 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1441 > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/ > > The list of bug fixes going into 3.4.0 can be found at the following URL: > https://issues.apache.org/jira/projects/SPARK/versions/12351465 > > This release is using the release script of the tag v3.4.0-rc7. > > > FAQ > > = > How can I help test this release? > = > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC and see if anything important breaks, in the Java/Scala > you can add the staging repository to your projects resolvers and test > with the RC (make sure to clean up the artifact cache before/after so > you don't end up building with an out of date RC going forward). > > === > What should happen to JIRA tickets still targeting 3.4.0? > === > The current list of open tickets targeted at 3.4.0 can be found at: > https://issues.apache.org/jira/projects/SPARK and search for "Target > Version/s" = 3.4.0 > > Committers should look at those and triage. Extremely important bug > fixes, documentation, and API tweaks that impact compatibility should > be worked on immediately. Everything else please retarget to an > appropriate release. > > == > But my bug isn't fixed? > == > In order to make timely releases, we will typically not hold the > release unless the bug in question is a regression from the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. > > Thanks, > Xinrong Meng >
Re: [VOTE] Release Apache Spark 3.4.0 (RC5)
+1 Compiled on Java 17 with Scala 2.13 on macos and ran some basic code. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Mar 30, 2023 at 10:21 AM Xinrong Meng wrote: > Please vote on releasing the following candidate(RC5) as Apache Spark > version 3.4.0. > > The vote is open until 11:59pm Pacific time *April 4th* and passes if a > majority +1 PMC votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 3.4.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is *v3.4.0-rc5* (commit > f39ad617d32a671e120464e4a75986241d72c487): > https://github.com/apache/spark/tree/v3.4.0-rc5 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1439 > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/ > > The list of bug fixes going into 3.4.0 can be found at the following URL: > https://issues.apache.org/jira/projects/SPARK/versions/12351465 > > This release is using the release script of the tag v3.4.0-rc5. > > > FAQ > > = > How can I help test this release? > = > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC and see if anything important breaks, in the Java/Scala > you can add the staging repository to your projects resolvers and test > with the RC (make sure to clean up the artifact cache before/after so > you don't end up building with an out of date RC going forward). > > === > What should happen to JIRA tickets still targeting 3.4.0? > === > The current list of open tickets targeted at 3.4.0 can be found at: > https://issues.apache.org/jira/projects/SPARK and search for "Target > Version/s" = 3.4.0 > > Committers should look at those and triage. Extremely important bug > fixes, documentation, and API tweaks that impact compatibility should > be worked on immediately. Everything else please retarget to an > appropriate release. > > == > But my bug isn't fixed? > == > In order to make timely releases, we will typically not hold the > release unless the bug in question is a regression from the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. > > Thanks, > Xinrong Meng > >
Re: starter tasks for new contributors
Hey Maxim, Very great kudos for the idea! Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Mar 17, 2023 at 2:18 PM Maxim Gekk wrote: > Hello Devs, > > If you want to contribute to Apache Spark but don't know where to start > from, I would like to propose new starter tasks for you. All the tasks > below in the umbrella JIRA > https://issues.apache.org/jira/browse/SPARK-37935 aim to improve Spark > error messages: > - SPARK-42838: Assign a name to the error class _LEGACY_ERROR_TEMP_2000 > - SPARK-42839: Assign a name to the error class _LEGACY_ERROR_TEMP_2003 > - SPARK-42840: Assign a name to the error class _LEGACY_ERROR_TEMP_2004 > - SPARK-42841: Assign a name to the error class _LEGACY_ERROR_TEMP_2005 > - SPARK-42842: Assign a name to the error class _LEGACY_ERROR_TEMP_2006 > - SPARK-42843: Assign a name to the error class _LEGACY_ERROR_TEMP_2007 > - SPARK-42844: Assign a name to the error class _LEGACY_ERROR_TEMP_2008 > - SPARK-42845: Assign a name to the error class _LEGACY_ERROR_TEMP_2010 > - SPARK-42846: Assign a name to the error class _LEGACY_ERROR_TEMP_2011 > - SPARK-42847: Assign a name to the error class _LEGACY_ERROR_TEMP_2013 > > Improving Spark errors makes you more familiar with Spark code base, and > you will impact on user experience with Spark SQL directly. > > If you have any questions, please, leave comments in the JIRAs, and feel > free to ping me (and other contributors) in your PRs. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. >
Re: [ANNOUNCE] Apache Spark 3.3.1 released
Yoohoo! Thanks Yuming for driving this release. A tiny step for Spark a huge one for my clients (who still are on 3.2.1 or even older :)) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Oct 26, 2022 at 8:22 AM Yuming Wang wrote: > We are happy to announce the availability of Apache Spark 3.3.1! > > Spark 3.3.1 is a maintenance release containing stability fixes. This > release is based on the branch-3.3 maintenance branch of Spark. We strongly > recommend all 3.3 users to upgrade to this stable release. > > To download Spark 3.3.1, head over to the download page: > https://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-3-3-1.html > > We would like to acknowledge all community members for contributing to this > release. This release would not have been possible without you. > > >
Re: [VOTE] Release Spark 3.2.0 (RC6)
Hi, I don't want to hijack the voting thread but given I faced https://issues.apache.org/jira/browse/SPARK-36904 in RC6 I wonder if it's -1. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Sep 29, 2021 at 10:28 PM Mridul Muralidharan wrote: > > Yi Wu helped identify an issue > <https://issues.apache.org/jira/browse/SPARK-36892> which causes > correctness (duplication) and hangs - waiting for validation to complete > before submitting a patch. > > Regards, > Mridul > > On Wed, Sep 29, 2021 at 11:34 AM Holden Karau > wrote: > >> PySpark smoke tests pass, I'm going to do a last pass through the JIRAs >> before my vote though. >> >> On Wed, Sep 29, 2021 at 8:54 AM Sean Owen wrote: >> >>> +1 looks good to me as before, now that a few recent issues are resolved. >>> >>> >>> On Tue, Sep 28, 2021 at 10:45 AM Gengliang Wang >>> wrote: >>> >>>> Please vote on releasing the following candidate as >>>> Apache Spark version 3.2.0. >>>> >>>> The vote is open until 11:59pm Pacific time September 30 and passes if >>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. >>>> >>>> [ ] +1 Release this package as Apache Spark 3.2.0 >>>> [ ] -1 Do not release this package because ... >>>> >>>> To learn more about Apache Spark, please see http://spark.apache.org/ >>>> >>>> The tag to be voted on is v3.2.0-rc6 (commit >>>> dde73e2e1c7e55c8e740cb159872e081ddfa7ed6): >>>> https://github.com/apache/spark/tree/v3.2.0-rc6 >>>> >>>> The release files, including signatures, digests, etc. can be found at: >>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-bin/ >>>> >>>> Signatures used for Spark RCs can be found in this file: >>>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>>> >>>> The staging repository for this release can be found at: >>>> https://repository.apache.org/content/repositories/orgapachespark-1393 >>>> >>>> The documentation corresponding to this release can be found at: >>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-docs/ >>>> >>>> The list of bug fixes going into 3.2.0 can be found at the following >>>> URL: >>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407 >>>> >>>> This release is using the release script of the tag v3.2.0-rc6. >>>> >>>> >>>> FAQ >>>> >>>> = >>>> How can I help test this release? >>>> = >>>> If you are a Spark user, you can help us test this release by taking >>>> an existing Spark workload and running on this release candidate, then >>>> reporting any regressions. >>>> >>>> If you're working in PySpark you can set up a virtual env and install >>>> the current RC and see if anything important breaks, in the Java/Scala >>>> you can add the staging repository to your projects resolvers and test >>>> with the RC (make sure to clean up the artifact cache before/after so >>>> you don't end up building with a out of date RC going forward). >>>> >>>> === >>>> What should happen to JIRA tickets still targeting 3.2.0? >>>> === >>>> The current list of open tickets targeted at 3.2.0 can be found at: >>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>> Version/s" = 3.2.0 >>>> >>>> Committers should look at those and triage. Extremely important bug >>>> fixes, documentation, and API tweaks that impact compatibility should >>>> be worked on immediately. Everything else please retarget to an >>>> appropriate release. >>>> >>>> == >>>> But my bug isn't fixed? >>>> == >>>> In order to make timely releases, we will typically not hold the >>>> release unless the bug in question is a regression from the previous >>>> release. That being said, if there is something which is a regression >>>> that has not been correctly targeted please ping me or a committer to >>>> help target the issue. >>>> >>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >
v3.2.0-rc6 and org.postgresql.Driver was not found in the CLASSPATH
Hi, Just ran a freshly-built 3.2.0 RC6 and faced an issue (that seems to be reported earlier on SO): > The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH More details in https://issues.apache.org/jira/browse/SPARK-36904 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [SQL][AQE] Advice needed: a trivial code change with a huge reading impact?
Salut Sean ! Merci beaucoup mon ami Sean ! That's exactly an answer I hoped for. Thank you! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Sep 8, 2021 at 1:53 PM Sean Owen wrote: > That does seem pointless. The body could just be .flatten()-ed to achieve > the same result. Maybe it was just written that way for symmetry with the > block above. You could open a PR to change it. > > On Wed, Sep 8, 2021 at 4:31 AM Jacek Laskowski wrote: > >> Hi Spark Devs, >> >> I'm curious what your take on this code [1] would be if you were me >> trying to understand it: >> >> (0 until 1).flatMap { _ => >> (splitPoints :+ numMappers).sliding(2).map { >> case Seq(start, end) => CoalescedMapperPartitionSpec(start, >> end, numReducers) >> } >> } >> >> There's something important going on here but it's so convoluted that my >> Scala coding skills seem not enough (not to mention AQE skills themselves). >> >> I'm tempted to change (0 until 1) to Seq(0), but Seq(0).flatMap feels >> awkward too. Is this Seq(0).flatMap even needed?! Even with no splitPoints >> we've got numReducers > 0. >> >> Looks like the above is as simple as >> >> (splitPoints :+ numMappers).sliding(2).map { >> case Seq(start, end) => CoalescedMapperPartitionSpec(start, >> end, numReducers) >> } >> >> Correct? >> >> I'm mentally used up and can't seem to think straight. Would a PR with >> such a change be acceptable? (Sean I'm looking at you :D) >> >> [1] >> https://github.com/apache/spark/blob/8d817dcf3084d56da22b909d578a644143f775d5/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala#L89-L93 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >
[SQL][AQE] Advice needed: a trivial code change with a huge reading impact?
Hi Spark Devs, I'm curious what your take on this code [1] would be if you were me trying to understand it: (0 until 1).flatMap { _ => (splitPoints :+ numMappers).sliding(2).map { case Seq(start, end) => CoalescedMapperPartitionSpec(start, end, numReducers) } } There's something important going on here but it's so convoluted that my Scala coding skills seem not enough (not to mention AQE skills themselves). I'm tempted to change (0 until 1) to Seq(0), but Seq(0).flatMap feels awkward too. Is this Seq(0).flatMap even needed?! Even with no splitPoints we've got numReducers > 0. Looks like the above is as simple as (splitPoints :+ numMappers).sliding(2).map { case Seq(start, end) => CoalescedMapperPartitionSpec(start, end, numReducers) } Correct? I'm mentally used up and can't seem to think straight. Would a PR with such a change be acceptable? (Sean I'm looking at you :D) [1] https://github.com/apache/spark/blob/8d817dcf3084d56da22b909d578a644143f775d5/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala#L89-L93 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false
Thanks Wenchen. If it's ever asked on SO I'm simply gonna quote you :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Tue, Sep 7, 2021 at 6:58 AM Wenchen Fan wrote: > This is correct. It's true by default so that AQE doesn't have performance > regression. If you run a benchmark, larger parallelism usually means better > performance. However, it's recommended to set it to false, so that AQE can > give better resource utilization, which is good for a busy Spark cluster. > > On Fri, Sep 3, 2021 at 7:33 PM Jacek Laskowski wrote: > >> Hi, >> >> Found this new spark.sql.adaptive.coalescePartitions.parallelismFirst >> config property [1] with the default value `true` but the descriptions says >> the opposite: >> >> > It's recommended to set this config to false >> >> Is this OK and I'm misreading it? >> >> [1] >> https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L519-L530 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >
[SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false
Hi, Found this new spark.sql.adaptive.coalescePartitions.parallelismFirst config property [1] with the default value `true` but the descriptions says the opposite: > It's recommended to set this config to false Is this OK and I'm misreading it? [1] https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L519-L530 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
[SQL] When SQLConf vals gets own accessor defs?
Hi, Just found something I'd consider an inconsistency in how SQLConf constants (vals) get their own accessor method for the current value (defs). I thought that internal config prop vals might not have defs (and that would make sense) but DYNAMIC_PARTITION_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO [1] (with SQLConf.dynamicPartitionPruningPruningSideExtraFilterRatio [2]) seems to contradict it. On the other hand, ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD [3] that's new in 3.2.0 and seems important does not have a corresponding def to access the current value. Why? [1] https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L344 [2] https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3532 [3] https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L638 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [VOTE] Release Spark 3.2.0 (RC1)
Hi Yi Wu, Looks like the issue has got resolution: Won't Fix. How about your -1? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Mon, Aug 23, 2021 at 4:58 AM Yi Wu wrote: > -1. I found a bug (https://issues.apache.org/jira/browse/SPARK-36558) in > the push-based shuffle, which could lead to job hang. > > Bests, > Yi > > On Sat, Aug 21, 2021 at 1:05 AM Gengliang Wang wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 3.2.0. >> >> The vote is open until 11:59pm Pacific time Aug 25 and passes if a >> majority +1 PMC votes are cast, with a minimum of 3 +1 votes. >> >> [ ] +1 Release this package as Apache Spark 3.2.0 >> [ ] -1 Do not release this package because ... >> >> To learn more about Apache Spark, please see http://spark.apache.org/ >> >> The tag to be voted on is v3.2.0-rc1 (commit >> 6bb3523d8e838bd2082fb90d7f3741339245c044): >> https://github.com/apache/spark/tree/v3.2.0-rc1 >> >> The release files, including signatures, digests, etc. can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/ >> >> Signatures used for Spark RCs can be found in this file: >> https://dist.apache.org/repos/dist/dev/spark/KEYS >> >> The staging repository for this release can be found at: >> https://repository.apache.org/content/repositories/orgapachespark-1388 >> >> The documentation corresponding to this release can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/ >> >> The list of bug fixes going into 3.2.0 can be found at the following URL: >> https://issues.apache.org/jira/projects/SPARK/versions/12349407 >> >> This release is using the release script of the tag v3.2.0-rc1. >> >> >> FAQ >> >> = >> How can I help test this release? >> = >> If you are a Spark user, you can help us test this release by taking >> an existing Spark workload and running on this release candidate, then >> reporting any regressions. >> >> If you're working in PySpark you can set up a virtual env and install >> the current RC and see if anything important breaks, in the Java/Scala >> you can add the staging repository to your projects resolvers and test >> with the RC (make sure to clean up the artifact cache before/after so >> you don't end up building with a out of date RC going forward). >> >> === >> What should happen to JIRA tickets still targeting 3.2.0? >> === >> The current list of open tickets targeted at 3.2.0 can be found at: >> https://issues.apache.org/jira/projects/SPARK and search for "Target >> Version/s" = 3.2.0 >> >> Committers should look at those and triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should >> be worked on immediately. Everything else please retarget to an >> appropriate release. >> >> == >> But my bug isn't fixed? >> == >> In order to make timely releases, we will typically not hold the >> release unless the bug in question is a regression from the previous >> release. That being said, if there is something which is a regression >> that has not been correctly targeted please ping me or a committer to >> help target the issue. >> >>
Re: [VOTE] Release Spark 3.2.0 (RC1)
Hi Gengliang, Yay! Thank you! Java 11 with the following MAVEN_OPTS worked fine: $ echo $MAVEN_OPTS -Xss64m -Xmx4g -XX:ReservedCodeCacheSize=1g $ ./build/mvn \ -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ -DskipTests \ clean install ... [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 22:02 min [INFO] Finished at: 2021-08-22T13:09:25+02:00 [INFO] Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sun, Aug 22, 2021 at 12:45 PM Jacek Laskowski wrote: > Hi Gengliang, > > With Java 8 the build worked fine. No other changes. I'm going to give > Java 11 a try with the options you mentioned. > > $ java -version > openjdk version "1.8.0_292" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode) > > BTW, Shouldn't the page [1] be updated to reflect this? This is what I > followed. > > [1] > https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage > > Thanks > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> > > > On Sun, Aug 22, 2021 at 8:29 AM Gengliang Wang wrote: > >> Hi Jacek, >> >> The current GitHub action CI for Spark contains Java 11 build. The build >> is successful with the options "-Xss64m -Xmx2g >> -XX:ReservedCodeCacheSize=1g": >> >> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L506 >> The default Java stack size is small and we have to raise it for Spark >> build with the option "-Xss64m". >> >> On Sat, Aug 21, 2021 at 9:33 PM Jacek Laskowski wrote: >> >>> Hi, >>> >>> I've been building the tag and I'm facing the >>> following StackOverflowError: >>> >>> Exception in thread "main" java.lang.StackOverflowError >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>> at >>> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597) >>> at scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:280) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:133) >>> at scala.reflect.internal.Trees.itransform(Trees.scala:1430) >>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >>> at >>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>> at scala.reflect.internal.Trees.itransform(Trees.scala:1409) >>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >>> at >>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>> at >>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>> ... >>> >>> The command I use: >>> >>> ./build/mvn \ >>> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ >>>
Re: [VOTE] Release Spark 3.2.0 (RC1)
Hi Gengliang, With Java 8 the build worked fine. No other changes. I'm going to give Java 11 a try with the options you mentioned. $ java -version openjdk version "1.8.0_292" OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10) OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode) BTW, Shouldn't the page [1] be updated to reflect this? This is what I followed. [1] https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage Thanks Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sun, Aug 22, 2021 at 8:29 AM Gengliang Wang wrote: > Hi Jacek, > > The current GitHub action CI for Spark contains Java 11 build. The build > is successful with the options "-Xss64m -Xmx2g > -XX:ReservedCodeCacheSize=1g": > > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L506 > The default Java stack size is small and we have to raise it for Spark > build with the option "-Xss64m". > > On Sat, Aug 21, 2021 at 9:33 PM Jacek Laskowski wrote: > >> Hi, >> >> I've been building the tag and I'm facing the >> following StackOverflowError: >> >> Exception in thread "main" java.lang.StackOverflowError >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >> at >> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597) >> at scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:280) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:133) >> at scala.reflect.internal.Trees.itransform(Trees.scala:1430) >> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >> at >> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >> at scala.reflect.internal.Trees.itransform(Trees.scala:1409) >> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >> at >> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >> at >> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >> ... >> >> The command I use: >> >> ./build/mvn \ >> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ >> -DskipTests \ >> clean install >> >> $ java --version >> openjdk 11.0.11 2021-04-20 >> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) >> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed >> mode) >> >> $ ./build/mvn -v >> Using `mvn` from path: /usr/local/bin/mvn >> Apache Maven 3.8.1 (05c21c65bdfed0f71a2f2ada8b84da59348c4c5d) >> Maven home: /usr/local/Cellar/maven/3.8.1/libexec >> Java version: 11.0.11, vendor: AdoptOpenJDK, runtime: >> /Users/jacek/.sdkman/candidates/java/11.0.11.hs-adpt >> Default locale: en_PL, platform encoding: UTF-8 >> OS name: "mac os x", version: "11.5", arch: "x86_64", family: "mac" >> >> $ echo $MAVEN_OPTS >> -Xmx8g -XX:ReservedCodeCacheSize=1g >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >> >> On Fri, Aug 20, 2021 at 7:05 PM Gengliang
Re: [VOTE] Release Spark 3.2.0 (RC1)
Hi, I've been building the tag and I'm facing the following StackOverflowError: Exception in thread "main" java.lang.StackOverflowError at scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) at scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) at scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597) at scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595) at scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:280) at scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:133) at scala.reflect.internal.Trees.itransform(Trees.scala:1430) at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) at scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) at scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) at scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) at scala.reflect.internal.Trees.itransform(Trees.scala:1409) at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) at scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) at scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) at scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) ... The command I use: ./build/mvn \ -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \ -DskipTests \ clean install $ java --version openjdk 11.0.11 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) $ ./build/mvn -v Using `mvn` from path: /usr/local/bin/mvn Apache Maven 3.8.1 (05c21c65bdfed0f71a2f2ada8b84da59348c4c5d) Maven home: /usr/local/Cellar/maven/3.8.1/libexec Java version: 11.0.11, vendor: AdoptOpenJDK, runtime: /Users/jacek/.sdkman/candidates/java/11.0.11.hs-adpt Default locale: en_PL, platform encoding: UTF-8 OS name: "mac os x", version: "11.5", arch: "x86_64", family: "mac" $ echo $MAVEN_OPTS -Xmx8g -XX:ReservedCodeCacheSize=1g Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Aug 20, 2021 at 7:05 PM Gengliang Wang wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.2.0. > > The vote is open until 11:59pm Pacific time Aug 25 and passes if a > majority +1 PMC votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 3.2.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is v3.2.0-rc1 (commit > 6bb3523d8e838bd2082fb90d7f3741339245c044): > https://github.com/apache/spark/tree/v3.2.0-rc1 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1388 > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/ > > The list of bug fixes going into 3.2.0 can be found at the following URL: > https://issues.apache.org/jira/projects/SPARK/versions/12349407 > > This release is using the release script of the tag v3.2.0-rc1. > > > FAQ > > = > How can I help test this release? > = > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC and see if anything important breaks, in the Java/Scala > you can add the staging repository to your projects resolvers and test > with the RC (make
TreeNode.exists?
Hi, It's been a couple of times already when I ran into a code like the following ([1]): val isCommand = plan.find { case _: Command | _: ParsedStatement | _: InsertIntoDir => true case _ => false }.isDefined I think that this and the other places beg (scream?) for TreeNode.exists that could do the simplest thing possible: find(f).isDefined or even collectFirst(...).isDefined It'd surely help a lot for people like myself reading the code. WDYT? [1] https://github.com/apache/spark/pull/33671/files#diff-4d16a733f8741de9a4b839ee7c356c3e9b439b4facc70018f5741da1e930c6a8R51-R54 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [ANNOUNCE] Apache Spark 3.1.2 released
Big shout-out to you, Dongjoon! Thank you. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jun 2, 2021 at 2:59 AM Dongjoon Hyun wrote: > We are happy to announce the availability of Spark 3.1.2! > > Spark 3.1.2 is a maintenance release containing stability fixes. This > release is based on the branch-3.1 maintenance branch of Spark. We strongly > recommend all 3.1 users to upgrade to this stable release. > > To download Spark 3.1.2, head over to the download page: > https://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-3-1-2.html > > We would like to acknowledge all community members for contributing to this > release. This release would not have been possible without you. > > Dongjoon Hyun >
Should AggregationIterator.initializeBuffer be moved down to SortBasedAggregationIterator?
Hi, Just found out that the only purpose of AggregationIterator.initializeBuffer is to keep SortBasedAggregationIterator happy [1]. Shouldn't it be moved down to SortBasedAggregationIterator to make things clear(er)? [1] https://github.com/apache/spark/search?q=initializeBuffer Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Purpose of OffsetHolder as a LeafNode?
Hi, Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode? What logical plan could it be part of? [1] https://github.com/apache/spark/blob/1a6708918b32e821bff26a00d2d8b7236b29515f/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L633 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: Welcoming six new Apache Spark committers
Hi, Congrats to all of you committers! Wishing you all the best (commits)! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Mar 26, 2021 at 9:21 PM Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor to PySpark) > - Max Gekk (contributor to Spark SQL) > - Kent Yao (contributor to Spark SQL) > - Attila Zsolt Piros (contributor to decommissioning and Spark on > Kubernetes) > - Yi Wu (contributor to Spark Core and SQL) > - Gabor Somogyi (contributor to Streaming and security) > > All six of them contributed to Spark 3.1 and we’re very excited to have > them join as committers. > > Matei and the Spark PMC > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [VOTE] Release Spark 3.1.1 (RC2)
Hi Hyukjin, FYI: cloud-fan commented 3 hours ago: thanks, merging to master/3.1! https://github.com/apache/spark/pull/31550#issuecomment-781977920 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Feb 13, 2021 at 3:46 AM Hyukjin Kwon wrote: > I have received a ping about a new blocker, a regression on a temporary > function in CTE - worked before but now it's broken ( > https://github.com/apache/spark/pull/31550). Thank you @Peter Toth > > I tend to treat this as a legitimate blocker. I will cut another RC right > after this fix if we're all good with it. > > 2021년 2월 11일 (목) 오전 9:20, Takeshi Yamamuro 님이 작성: > >> +1 >> >> I looked around the jira tickets and I think there is no explicit blocker >> issue on the Spark SQL component. >> Also, I ran the tests on AWS envs and I couldn't find any issue there, >> too. >> >> Bests, >> Takeshi >> >> On Thu, Feb 11, 2021 at 7:37 AM Mridul Muralidharan >> wrote: >> >>> >>> Signatures, digests, etc check out fine. >>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive >>> -Phive-thriftserver -Pmesos -Pkubernetes >>> >>> I keep getting test failures >>> with org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite: removing this >>> suite gets the build through though - does anyone have suggestions on how >>> to fix it ? >>> Perhaps a local problem at my end ? >>> >>> >>> Regards, >>> Mridul >>> >>> >>> >>> On Mon, Feb 8, 2021 at 6:24 PM Hyukjin Kwon wrote: >>> >>>> Please vote on releasing the following candidate as Apache Spark >>>> version 3.1.1. >>>> >>>> The vote is open until February 15th 5PM PST and passes if a majority >>>> +1 PMC votes are cast, with a minimum of 3 +1 votes. >>>> >>>> Note that it is 7 days this time because it is a holiday season in >>>> several countries including South Korea (where I live), China etc., and I >>>> would like to make sure people do not miss it because it is a holiday >>>> season. >>>> >>>> [ ] +1 Release this package as Apache Spark 3.1.1 >>>> [ ] -1 Do not release this package because ... >>>> >>>> To learn more about Apache Spark, please see http://spark.apache.org/ >>>> >>>> The tag to be voted on is v3.1.1-rc2 (commit >>>> cf0115ac2d60070399af481b14566f33d22ec45e): >>>> https://github.com/apache/spark/tree/v3.1.1-rc2 >>>> >>>> The release files, including signatures, digests, etc. can be found at: >>>> <https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/> >>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/ >>>> >>>> Signatures used for Spark RCs can be found in this file: >>>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>>> >>>> The staging repository for this release can be found at: >>>> https://repository.apache.org/content/repositories/orgapachespark-1365 >>>> >>>> The documentation corresponding to this release can be found at: >>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/ >>>> >>>> The list of bug fixes going into 3.1.1 can be found at the following >>>> URL: >>>> https://s.apache.org/41kf2 >>>> >>>> This release is using the release script of the tag v3.1.1-rc2. >>>> >>>> FAQ >>>> >>>> === >>>> What happened to 3.1.0? >>>> === >>>> >>>> There was a technical issue during Apache Spark 3.1.0 preparation, and >>>> it was discussed and decided to skip 3.1.0. >>>> Please see >>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for >>>> more details. >>>> >>>> = >>>> How can I help test this release? >>>> = >>>> >>>> If you are a Spark user, you can help us test this release by taking >>>> an existing Spark workload and running on this release candidate, then >>>> reporting any regressions. >>>> >>>> If you're working in PySpark you
Re: [DISCUSS] Add RocksDB StateStore
Hi, I'm "okay to add RocksDB StateStore as external module". See no reason not to. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh wrote: > Hi devs, > > In Spark structured streaming, we need state store for state management for > stateful operators such streaming aggregates, joins, etc. We have one and > only one state store implementation now. It is in-memory hashmap which was > backed up in HDFS complaint file system at the end of every micro-batch. > > As it basically uses in-memory map to store states, memory consumption is a > serious issue and state store size is limited by the size of the executor > memory. Moreover, state store using more memory means it may impact the > performance of task execution that requires memory too. > > Internally we see more streaming applications that requires large state in > stateful operations. For such requirements, we need a StateStore not rely > on > memory to store states. > > This seems to be also true externally as several other major streaming > frameworks already use RocksDB for state management. RocksDB is an embedded > DB and streaming engines can use it to store state instead of memory > storage. > > So seems to me, it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > > Previously there was one attempt SPARK-28120 to add RocksDB StateStore into > Spark SS. IIUC, it was pushed back due to two concerns: extra code > maintenance cost and it introduces RocksDB dependency. > > For the first concern, as more users require to use the feature, it should > be highly used code in SS and more developers will look at it. For second > one, we propose (SPARK-34198) to add it as an external module to relieve > the > dependency concern. > > Because it was pushed back previously, I'm going to raise this discussion > to > know what people think about it now, in advance of submitting any code. > > I think there might be some possible opinions: > > 1. okay to add RocksDB StateStore into sql core module > 2. not okay for 1, but okay to add RocksDB StateStore as external module > 3. either 1 or 2 is okay > 4. not okay to add RocksDB StateStore, no matter into sql core or as > external module > > Please let us know if you have some thoughts. > > Thank you. > > Liang-Chi Hsieh > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: How to contribute the code
Hi, http://spark.apache.org/contributing.html ? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Jan 30, 2021 at 3:55 PM hammerCS <1624067...@qq.com> wrote: > I have learned Spark for one year .And now, I want to get the hang of spark > by contributing code .. > But the code is so massive , I don't know how to do. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
[K8S] ExecutorPodsWatchSnapshotSource with no spark-exec-inactive label in 3.1?
Hi, spark-exec-inactive label is not used to watch executor pods in ExecutorPodsWatchSnapshotSource [1] as it happens in PollRunnable [2]. Any reasons? Am I correct to consider it a bug? [1] https://github.com/apache/spark/blob/270a9a3cac25f3e799460320d0fc94ccd7ecfaea/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L39-L42 [2] https://github.com/apache/spark/blob/ee624821a903a263c844ed849d6833df8e9ad43e/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala#L62 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: Should 3.1.0 config props be 3.1.1 (as s.k.executor.missingPodDetectDelta)?
Hi Hyukjin, Agreed. I asked to see if I'm not missing anything. Thank you. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Jan 23, 2021 at 1:01 PM Hyukjin Kwon wrote: > I think we can just leave it as is. We have the unofficial 3.1.0 release > with its corresponding git tag so 3.1.0 mark isn't completely useless. > > Also changing means we should go through JIRAs and change the version > properties, fixing conflict, etc. which I don't think is worthwhile. > > On Sat, 23 Jan 2021, 20:43 Jacek Laskowski, wrote: > >> Hi, >> >> Just noticed a couple of configuration properties marked >> as version("3.1.0") in Spark on Kubernetes' Config. There's one >> spark.kubernetes.executor.missingPodDetectDelta property marked as 3.1.1. >> >> That triggered this question in my mind whether all together should be >> 3.1.1? >> >> (We could leave it as is as an "easter egg"-like thing too) >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >
Should 3.1.0 config props be 3.1.1 (as s.k.executor.missingPodDetectDelta)?
Hi, Just noticed a couple of configuration properties marked as version("3.1.0") in Spark on Kubernetes' Config. There's one spark.kubernetes.executor.missingPodDetectDelta property marked as 3.1.1. That triggered this question in my mind whether all together should be 3.1.1? (We could leave it as is as an "easter egg"-like thing too) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [VOTE] Release Spark 3.1.1 (RC1)
Hi, +1 (non-binding) 1. Built locally using AdoptOpenJDK (build 11.0.9+11) with -Pyarn,kubernetes,hive-thriftserver,scala-2.12 -DskipTests 2. Ran batch and streaming demos using Spark on Kubernetes (minikube) using spark-shell (client deploy mode) and spark-submit --deploy-mode cluster I reported a non-blocking issue with "the only developer Matei" ( https://issues.apache.org/jira/browse/SPARK-34158) Found a minor non-blocking (but annoying) issue in Spark on k8s that's different from 3.0.1 that should really be silenced as the other debug message in ExecutorPodsAllocator: 21/01/19 12:23:26 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 2 running, 0 pending. 0 unacknowledged. 21/01/19 12:23:27 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 2 running, 0 pending. 0 unacknowledged. 21/01/19 12:23:28 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 2 running, 0 pending. 0 unacknowledged. 21/01/19 12:23:29 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod allocation status: 2 running, 0 pending. 0 unacknowledged. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Mon, Jan 18, 2021 at 1:06 PM Hyukjin Kwon wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.1.1. > > The vote is open until January 22nd 4PM PST and passes if a majority +1 > PMC votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 3.1.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is v3.1.1-rc1 (commit > 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d): > https://github.com/apache/spark/tree/v3.1.1-rc1 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1364 > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/ > > The list of bug fixes going into 3.1.1 can be found at the following URL: > https://s.apache.org/41kf2 > > This release is using the release script of the tag v3.1.1-rc1. > > FAQ > > === > What happened to 3.1.0? > === > > There was a technical issue during Apache Spark 3.1.0 preparation, and it > was discussed and decided to skip 3.1.0. > Please see > https://spark.apache.org/news/next-official-release-spark-3.1.1.html for > more details. > > = > How can I help test this release? > = > > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC via "pip install > https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz > " > and see if anything important breaks. > In the Java/Scala, you can add the staging repository to your projects > resolvers and test > with the RC (make sure to clean up the artifact cache before/after so > you don't end up building with an out of date RC going forward). > > === > What should happen to JIRA tickets still targeting 3.1.1? > === > > The current list of open tickets targeted at 3.1.1 can be found at: > https://issues.apache.org/jira/projects/SPARK and search for "Target > Version/s" = 3.1.1 > > Committers should look at those and triage. Extremely important bug > fixes, documentation, and API tweaks that impact compatibility should > be worked on immediately. Everything else please retarget to an > appropriate release. > > == > But my bug isn't fixed? > == > > In order to make timely releases, we will typically not hold the > release unless the bug in question is a regression from the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. > >
[K8S] KUBERNETES_EXECUTOR_REQUEST_CORES
Hi, I'm curious why this line [1] uses sparkConf to find KUBERNETES_EXECUTOR_REQUEST_CORES while the next [2] goes to kubernetesConf? (I'm not going to mention it's getOrElse case here and wait till I got OK to change this due to the above "misuse"). [1] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L71 [2] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L72 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [VOTE] Release Spark 3.1.0 (RC1)
Hi Sean, +1 to leave it. Makes so much more sense (as that's what really happened and the history of Apache Spark is...irreversible). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Jan 7, 2021 at 3:44 PM Sean Owen wrote: > While we can delete the tag, maybe just leave it. As a general rule we > would not remove anything pushed to the main git repo. > > On Thu, Jan 7, 2021 at 8:31 AM Jacek Laskowski wrote: > >> Hi, >> >> BTW, wondering aloud. Since it was agreed to skip 3.1.0 and go ahead with >> 3.1.1, what's gonna happen with v3.1.0 tag [1]? Is it going away and we'll >> see 3.1.1-rc1? >> >> [1] https://github.com/apache/spark/tree/v3.1.0-rc1 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >>>
Re: [VOTE] Release Spark 3.1.0 (RC1)
Hi, BTW, wondering aloud. Since it was agreed to skip 3.1.0 and go ahead with 3.1.1, what's gonna happen with v3.1.0 tag [1]? Is it going away and we'll see 3.1.1-rc1? [1] https://github.com/apache/spark/tree/v3.1.0-rc1 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Jan 7, 2021 at 3:12 PM Jacek Laskowski wrote: > Hi, > > I'm just reading this now. I'm for 3.1.1 with no 3.1.0 but the news that > we're skipping that particular release. Gonna be more fun! :) > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> > > > On Thu, Jan 7, 2021 at 6:13 AM Hyukjin Kwon wrote: > >> Thank you Holden and Wenchen! >> >> Let me: >> - prepare a PR for news in spark-website first about 3.1.0 accident late >> tonight (in KST) >> - and start to prepare 3.1.1 probably in few more days like next monday >> in case other people have different thoughts >> >> >> >> 2021년 1월 7일 (목) 오후 2:04, Holden Karau 님이 작성: >> >>> I think that posting the 3.1.0 maven release was an accident and we're >>> going to 3.1.1 RCs is the right step forward. >>> I'd ask for maybe a day before cutting the 3.1.1 release, I think >>> https://issues.apache.org/jira/browse/SPARK-34018 is also a blocker (at >>> first I thought it was just a test issue, but Dongjoon pointed out the NPE >>> happens in prod too). >>> >>> I'd also like to echo the: it's totally ok we all make mistakes >>> especially in partially manual & partially automated environments, I've >>> created a bunch of RCs labels without recognizing they were getting pushed >>> automatically. >>> >>> On Wed, Jan 6, 2021 at 8:57 PM Wenchen Fan wrote: >>> >>>> I agree with Jungtaek that people are likely to be biased when testing >>>> 3.1.0. At least this will not be the same community-blessed release as >>>> previous ones, because the voting is already affected by the fact that >>>> 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me. >>>> >>>> On Thu, Jan 7, 2021 at 12:54 PM Hyukjin Kwon >>>> wrote: >>>> >>>>> Okay, let me just start to prepare 3.1.1. I think that will address >>>>> all concerns except that 3.1.0 will remain in Maven as incomplete. >>>>> By right, removal in the Maven repo is disallowed. Overwrite is >>>>> possible as far as I know but other mirrors that maintain cache will get >>>>> affected. >>>>> Maven is one of the downstream publish channels, and we haven't >>>>> officially announced and published it to Apache repo anyway. >>>>> I will prepare to upload news in spark-website to explain that 3.1.0 >>>>> is incompletely published because there was something wrong during the >>>>> release process, and we go to 3.1.1 right away. >>>>> Are we all good with this? >>>>> >>>>> >>>>> >>>>> 2021년 1월 7일 (목) 오후 1:11, Hyukjin Kwon 님이 작성: >>>>> >>>>>> I think that It would be great though if we have a clear blocker that >>>>>> makes the release pointless if we want to drop this RC practically given >>>>>> that we will schedule 3.1.1 faster - non-regression bug fixes will be >>>>>> delivered to end users relatively fast. >>>>>> That would make it clear which option we should take. I personally >>>>>> don't mind dropping 3.1.0 as well; we'll have to wait for the INFRA >>>>>> team's >>>>>> response anyway. >>>>>> >>>>>> >>>>>> 2021년 1월 7일 (목) 오후 1:03, Sean Owen 님이 작성: >>>>>> >>>>>>> I don't agree the first two are blockers for reasons I gave earlier. >>>>>>> Those two do look like important issues - are they regressions from >>>>>>> 3.0.1? >>>>>>> I do agree we'd probably cut a new RC for those in any event, so >>>>>>> agree with the plan to drop 3.1.0 (if the Maven release can't be >>>>>>>
Re: [VOTE] Release Spark 3.1.0 (RC1)
Hi, I'm just reading this now. I'm for 3.1.1 with no 3.1.0 but the news that we're skipping that particular release. Gonna be more fun! :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Jan 7, 2021 at 6:13 AM Hyukjin Kwon wrote: > Thank you Holden and Wenchen! > > Let me: > - prepare a PR for news in spark-website first about 3.1.0 accident late > tonight (in KST) > - and start to prepare 3.1.1 probably in few more days like next monday in > case other people have different thoughts > > > > 2021년 1월 7일 (목) 오후 2:04, Holden Karau 님이 작성: > >> I think that posting the 3.1.0 maven release was an accident and we're >> going to 3.1.1 RCs is the right step forward. >> I'd ask for maybe a day before cutting the 3.1.1 release, I think >> https://issues.apache.org/jira/browse/SPARK-34018 is also a blocker (at >> first I thought it was just a test issue, but Dongjoon pointed out the NPE >> happens in prod too). >> >> I'd also like to echo the: it's totally ok we all make mistakes >> especially in partially manual & partially automated environments, I've >> created a bunch of RCs labels without recognizing they were getting pushed >> automatically. >> >> On Wed, Jan 6, 2021 at 8:57 PM Wenchen Fan wrote: >> >>> I agree with Jungtaek that people are likely to be biased when testing >>> 3.1.0. At least this will not be the same community-blessed release as >>> previous ones, because the voting is already affected by the fact that >>> 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me. >>> >>> On Thu, Jan 7, 2021 at 12:54 PM Hyukjin Kwon >>> wrote: >>> >>>> Okay, let me just start to prepare 3.1.1. I think that will address all >>>> concerns except that 3.1.0 will remain in Maven as incomplete. >>>> By right, removal in the Maven repo is disallowed. Overwrite is >>>> possible as far as I know but other mirrors that maintain cache will get >>>> affected. >>>> Maven is one of the downstream publish channels, and we haven't >>>> officially announced and published it to Apache repo anyway. >>>> I will prepare to upload news in spark-website to explain that 3.1.0 is >>>> incompletely published because there was something wrong during the release >>>> process, and we go to 3.1.1 right away. >>>> Are we all good with this? >>>> >>>> >>>> >>>> 2021년 1월 7일 (목) 오후 1:11, Hyukjin Kwon 님이 작성: >>>> >>>>> I think that It would be great though if we have a clear blocker that >>>>> makes the release pointless if we want to drop this RC practically given >>>>> that we will schedule 3.1.1 faster - non-regression bug fixes will be >>>>> delivered to end users relatively fast. >>>>> That would make it clear which option we should take. I personally >>>>> don't mind dropping 3.1.0 as well; we'll have to wait for the INFRA team's >>>>> response anyway. >>>>> >>>>> >>>>> 2021년 1월 7일 (목) 오후 1:03, Sean Owen 님이 작성: >>>>> >>>>>> I don't agree the first two are blockers for reasons I gave earlier. >>>>>> Those two do look like important issues - are they regressions from >>>>>> 3.0.1? >>>>>> I do agree we'd probably cut a new RC for those in any event, so >>>>>> agree with the plan to drop 3.1.0 (if the Maven release can't be >>>>>> overwritten) >>>>>> >>>>>> On Wed, Jan 6, 2021 at 9:38 PM Dongjoon Hyun >>>>>> wrote: >>>>>> >>>>>>> Before we discover the pre-uploaded artifacts, both Jungtaek and >>>>>>> Hyukjin already made two blockers shared here. >>>>>>> IIUC, it meant implicitly RC1 failure at that time. >>>>>>> >>>>>>> In addition to that, there are two correctness issues. So, I made up >>>>>>> my mind to cast -1 for this RC1 before joining this thread. >>>>>>> >>>>>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache >>>>>>> (committed after tagging) >>>>>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache >>>>>>> (PR is under review) >>>>>>> >>>>>>> Although the above issues are not regression, those are enough for >>>>>>> me to give -1 for 3.1.0 RC1. >>>>>>> >>>>>>> On Wed, Jan 6, 2021 at 3:52 PM Sean Owen wrote: >>>>>>> >>>>>>>> I just don't see a reason to believe there's a rush? just test it >>>>>>>> as normal? I did, you can too, etc. >>>>>>>> Or specifically what blocks the current RC? >>>>>>>> >>>>>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >
Re: [VOTE] Release Spark 3.1.0 (RC1)
Hi, I'm curious why Spark 3.1.0 is already available in repo1.maven.org? https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.0/ Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jan 6, 2021 at 1:01 AM Hyukjin Kwon wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.1.0. > > The vote is open until January 8th 4PM PST and passes if a majority +1 PMC > votes are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as Apache Spark 3.1.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is v3.1.0-rc1 (commit > 97340c1e34cfd84de445b6b7545cfa466a1baaf6): > https://github.com/apache/spark/tree/v3.1.0-rc1 > > The release files, including signatures, digests, etc. can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-bin/ > > Signatures used for Spark RCs can be found in this file: > https://dist.apache.org/repos/dist/dev/spark/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1363/ > > The documentation corresponding to this release can be found at: > https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/ > > The list of bug fixes going into 3.1.0 can be found at the following URL: > https://s.apache.org/ldzzl > > This release is using the release script of the tag v3.1.0-rc1. > > FAQ > > > = > How can I help test this release? > = > > If you are a Spark user, you can help us test this release by taking > an existing Spark workload and running on this release candidate, then > reporting any regressions. > > If you're working in PySpark you can set up a virtual env and install > the current RC via "pip install > https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-bin/pyspark-3.1.0.tar.gz > " > and see if anything important breaks. > In the Java/Scala, you can add the staging repository to your projects > resolvers and test > with the RC (make sure to clean up the artifact cache before/after so > you don't end up building with an out of date RC going forward). > > === > What should happen to JIRA tickets still targeting 3.1.0? > === > > The current list of open tickets targeted at 3.1.0 can be found at: > https://issues.apache.org/jira/projects/SPARK and search for "Target > Version/s" = 3.1.0 > > Committers should look at those and triage. Extremely important bug > fixes, documentation, and API tweaks that impact compatibility should > be worked on immediately. Everything else please retarget to an > appropriate release. > > == > But my bug isn't fixed? > == > > In order to make timely releases, we will typically not hold the > release unless the bug in question is a regression from the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. > >
Re: [3.0.1] ExecutorMonitor.onJobStart and StageInfo.shuffleDepId that's never used?
Hi, Sorry. A false alarm. Got mistaken with what IDEA calls "unused" may not really be unused. It is (re)assigned in StageInfo.fromStage for a ShuffleMapStage [1] and then caught in ExecutorMonitor [2] (since it's a SparkListener). [1] https://github.com/apache/spark/blob/094563384478a402c36415edf04ee7b884a34fc9/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala#L108 [2] https://github.com/apache/spark/blob/78df2caec8c94c31e5c9ddc30ed8acb424084181/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L179 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Dec 30, 2020 at 3:34 PM Jacek Laskowski wrote: > Hi, > > It's been a while. Glad to be back Sparkians! > > I've been exploring ExecutorMonitor.onJobStart in 3.0.1 and noticed that > it uses StageInfo.shuffleDepId [1] that is None by default and moreover > never "written to" according to IntelliJ IDEA. > > Is this the case and intentional? > > I'm wondering how much IDEA knows about codegen and that's where it's used > (?) > > I've just stumbled upon it and before I spend more time on this I thought > I'd ask (perhaps it's going to change in 3.1?). Help appreciated. > > [1] > https://github.com/apache/spark/blob/78df2caec8c94c31e5c9ddc30ed8acb424084181/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L179 > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> >
[3.0.1] ExecutorMonitor.onJobStart and StageInfo.shuffleDepId that's never used?
Hi, It's been a while. Glad to be back Sparkians! I've been exploring ExecutorMonitor.onJobStart in 3.0.1 and noticed that it uses StageInfo.shuffleDepId [1] that is None by default and moreover never "written to" according to IntelliJ IDEA. Is this the case and intentional? I'm wondering how much IDEA knows about codegen and that's where it's used (?) I've just stumbled upon it and before I spend more time on this I thought I'd ask (perhaps it's going to change in 3.1?). Help appreciated. [1] https://github.com/apache/spark/blob/78df2caec8c94c31e5c9ddc30ed8acb424084181/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L179 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: Incorrect Scala version for Spark 2.4.x releases in the docs?
Thanks Sean for such a quick response! Let me propose a fix for the docs. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Sep 17, 2020 at 4:16 PM Sean Owen wrote: > The pre-built binary distros should use 2.11 in 2.4.x. Artifacts for > both Scala versions are available, yes. > Yeah I think it should really say you can use 2.11 or 2.12. > > On Thu, Sep 17, 2020 at 9:12 AM Jacek Laskowski wrote: > > > > Hi, > > > > Just found this paragraph in > http://spark.apache.org/docs/2.4.6/index.html#downloading: > > > > "Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, > Spark 2.4.6 uses Scala 2.12. You will need to use a compatible Scala > version (2.12.x)." > > > > That seems to contradict the version of Scala in the pom.xml [1] which > is 2.11.12. I think this says that Spark 2.4.6 uses Scala 2.12 by default > which is incorrect to me. Am I missing something? > > > > My question is what's the official Scala version of Spark 2.4.6 (and > others in 2.4.x release line)? > > > > (I do know that Spark 2.4.x could be compiled with Scala 2.12, but that > requires scala-2.12 profile [2] to be enabled) > > > > [1] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L158 > > [2] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L2830 > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://about.me/JacekLaskowski > > "The Internals Of" Online Books > > Follow me on https://twitter.com/jaceklaskowski > > >
Incorrect Scala version for Spark 2.4.x releases in the docs?
Hi, Just found this paragraph in http://spark.apache.org/docs/2.4.6/index.html#downloading: "Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.6 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x)." That seems to contradict the version of Scala in the pom.xml [1] which is 2.11.12. I think this says that Spark 2.4.6 uses Scala 2.12 by default which is incorrect to me. Am I missing something? My question is what's the official Scala version of Spark 2.4.6 (and others in 2.4.x release line)? (I do know that Spark 2.4.x could be compiled with Scala 2.12, but that requires scala-2.12 profile [2] to be enabled) [1] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L158 [2] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L2830 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Why is V2SessionCatalog not a CatalogExtension?
Hi, Just started exploring Catalog Plugin API and noticed these two classes: CatalogExtension and V2SessionCatalog. Why is V2SessionCatalog not a CatalogExtension? - V2SessionCatalog extends TableCatalog with SupportsNamespaces [1] - CatalogExtension extends TableCatalog, SupportsNamespaces [2] [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L41 [2] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/CatalogExtension.java#L33 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Why time difference while registering a new BlockManager (using BlockManagerMasterEndpoint)?
Hi, Just noticed an inconsistency between times when a BlockManager is about to be registered [1][2] and the time listeners are going to be informed [3], and got curious whether it's intentional or not. Why is the `time` value not used for SparkListenerBlockManagerAdded message? [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L453 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L478 [3] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L481 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: BlockManager and ShuffleManager = can getLocalBytes be ever used for shuffle blocks?
Thanks Yi Wu! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Apr 18, 2020 at 12:17 PM wuyi wrote: > Hi Jacek, > > code in link [2] is the out of date. The commit > > https://github.com/apache/spark/commit/32ec528e63cb768f85644282978040157c3c2fb7 > has already removed unreachable branch. > > > Best, > Yi Wu > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
ShuffleMapStage and pendingPartitions vs isAvailable or findMissingPartitions?
Hi, I found that ShuffleMapStage has this (apparently superfluous) pendingPartitions registry [1] for DAGScheduler and the description says: " /** * Partitions that either haven't yet been computed, or that were computed on an executor * that has since been lost, so should be re-computed. This variable is used by the * DAGScheduler to determine when a stage has completed. Task successes in both the active * attempt for the stage or in earlier attempts for this stage can cause paritition ids to get * removed from pendingPartitions. As a result, this variable may be inconsistent with the pending * tasks in the TaskSetManager for the active attempt for the stage (the partitions stored here * will always be a subset of the partitions that the TaskSetManager thinks are pending). */ " I'm curious why there is a need for this pendingPartitions since isAvailable or findMissingPartitions (using MapOutputTrackerMaster) know it already and I think are even more up-to-date. Why is there this extra registry? [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala#L60 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
BlockManager and ShuffleManager = can getLocalBytes be ever used for shuffle blocks?
Hi, While trying to understand the relationship of BlockManager and ShuffleManager I found that ShuffleManager is used for shuffle block data [1] (and that makes sense). What I found quite surprising is that BlockManager can call getLocalBytes for non-shuffle blocks that in turn does...fetching shuffle blocks (!) [2]. That begs the question, who could want this? In other words, what other components would want to call BlockManager.getLocalBytes for shuffle blocks? The only caller is TorrentBroadcast which is not for shuffle blocks, is it? In other words, getLocalBytes should NOT be bothering itself with fetching local shuffle blocks as that's a responsibility of someone else, i.e. ShuffleManager. Is this correct? I'd appreciate any help you can provide so I can understand the Spark core better. Thanks! [1] https://github.com/apache/spark/blob/31734399d57f3c128e66b0f97ef83eb4c9165978/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L382 [2] https://github.com/apache/spark/blob/31734399d57f3c128e66b0f97ef83eb4c9165978/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L637 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: InferFiltersFromConstraints logical optimization rule and Optimizer.defaultBatches?
Hi Jungtaek, Thanks a lot for your answer. What you're saying reflects my understanding perfectly. There's a small change, but makes understanding where rules are used much simpler (= less confusing). I'll propose a PR and see where it goes from there. Thanks! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Apr 15, 2020 at 7:55 AM Jungtaek Lim wrote: > Please correct me if I'm missing something. At a glance, your statements > look correct if I understand correctly. I guess it might be simply missed, > but it sounds as pretty trivial one as only a line can be removed safely > which won't affect anything. (filterNot should be retained even we remove > the line to prevent extended rules to break this) > > On Sun, Apr 12, 2020 at 9:54 PM Jacek Laskowski wrote: > >> Hi, >> >> I'm curious why there is a need to include InferFiltersFromConstraints >> logical optimization in operatorOptimizationRuleSet value [1] that seems to >> be only to exclude it later [2]? >> >> In other words, I think that simply removing InferFiltersFromConstraints >> from operatorOptimizationRuleSet value [1] would make no change (except >> removing a "dead code"). >> >> Does this make sense? Could I be missing something? >> >> [1] >> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L80 >> [2] >> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L115 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >
InferFiltersFromConstraints logical optimization rule and Optimizer.defaultBatches?
Hi, I'm curious why there is a need to include InferFiltersFromConstraints logical optimization in operatorOptimizationRuleSet value [1] that seems to be only to exclude it later [2]? In other words, I think that simply removing InferFiltersFromConstraints from operatorOptimizationRuleSet value [1] would make no change (except removing a "dead code"). Does this make sense? Could I be missing something? [1] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L80 [2] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L115 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Re: [DOCS] Spark SQL Upgrading Guide
Hi, Never mind. Found this [1]: > This config is deprecated and it will be removed in 3.0.0. And so it has :) Thanks and sorry for the trouble. [1] https://github.com/apache/spark/blob/830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1306-L1307 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Feb 15, 2020 at 7:44 PM Jacek Laskowski wrote: > Hi, > > Just noticed that > http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html (Spark > 2.4.5) has formatting issues in "Upgrading from Spark SQL 2.4.3 to 2.4.4" > [1] which got fixed in master [2]. That's OK. > > What made me wonder was the other change to the section "Upgrading from > Spark SQL 2.4 to 2.4.5" [3] that had the following item included: > > "Starting from 2.4.5, SQL configurations are effective also when a Dataset > is converted to an RDD and its plan is executed due to action on the > derived RDD. The previous behavior can be restored setting > spark.sql.legacy.rdd.applyConf to false: in this case, SQL configurations > are ignored for operations performed on a RDD derived from a Dataset." > > Why was this removed in master [4]? It was mentioned in "Notable changes" > of Spark Release 2.4.5 [5]. > > [1] > http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-243-to-244 > [2] > https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-243-to-244 > [3] > http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-245 > [4] > https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-244-to-245 > [5] http://spark.apache.org/releases/spark-release-2-4-5.html > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > "The Internals Of" Online Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> >
[DOCS] Spark SQL Upgrading Guide
Hi, Just noticed that http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html (Spark 2.4.5) has formatting issues in "Upgrading from Spark SQL 2.4.3 to 2.4.4" [1] which got fixed in master [2]. That's OK. What made me wonder was the other change to the section "Upgrading from Spark SQL 2.4 to 2.4.5" [3] that had the following item included: "Starting from 2.4.5, SQL configurations are effective also when a Dataset is converted to an RDD and its plan is executed due to action on the derived RDD. The previous behavior can be restored setting spark.sql.legacy.rdd.applyConf to false: in this case, SQL configurations are ignored for operations performed on a RDD derived from a Dataset." Why was this removed in master [4]? It was mentioned in "Notable changes" of Spark Release 2.4.5 [5]. [1] http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-243-to-244 [2] https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-243-to-244 [3] http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-245 [4] https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-244-to-245 [5] http://spark.apache.org/releases/spark-release-2-4-5.html Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>
Does StreamingSymmetricHashJoinExec work with watermark? I don't think so
Hi, I think watermark does not work for StreamingSymmetricHashJoinExec because of the following: 1. leftKeys and rightKeys have no spark.watermarkDelayMs metadata entry at planning [1] 2. Since the left and right keys had no watermark delay at planning the code [2] won't find it at execution Is my understanding correct? If not, can you point me at examples with watermark on 1) join keys and 2) values ? Merci beaucoup. [1] https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L477-L478 [2] https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala#L156-L164 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
Re: [SS][2.4.4] Confused with "WatermarkTracker: Event time watermark didn't move"?
Hi HeartSaVioR, > It might be due to empty batch Yeah...that's my understanding too. It's for a streaming aggregation in Append output mode so that's possible. I'll have a closer look at it. Thanks much for keeping up with this and the other questions. Much appreciated! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Mon, Oct 14, 2019 at 12:42 AM Jungtaek Lim wrote: > It might be due to empty batch (activated when there're stateful > operator(s) and the previous batch advances watermark), which has no input > so no moving watermark. > > Did you only turn on DEBUG for WatermarkTracker? If you turn on DEBUG for > MicroBatchExecution as well, it would log "Completed batch " so if > I'm not missing, it should be logged between updating event-time watermark > and watermark didn't move. You can attach streaming query listener and get > more information about batches. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Tue, Oct 8, 2019 at 6:12 PM Jacek Laskowski wrote: > >> Hi, >> >> I haven't spent much time on it, but the following DEBUG message >> from WatermarkTracker sparked my interest :) >> >> I ran a streaming aggregation in Append mode and got the messages: >> >> 19/10/08 10:48:56 DEBUG WatermarkTracker: Observed event time stats 0: >> EventTimeStats(15000,1000,8000.0,2) >> 19/10/08 10:48:56 INFO WatermarkTracker: Updating event-time watermark >> from 0 to 5000 ms >> 19/10/08 10:48:56 DEBUG WatermarkTracker: Event time watermark didn't >> move: 5000 < 5000 >> >> I think the DEBUG message "Event time watermark didn't move" seems >> incorrect given that the query has just started and "Observed event time >> stats". It's true that the event-time watermark didn't move if it was 5000 >> before, but it was not as it has just started from scratch (no checkpointed >> state). >> >> Can anyone shed some light on this? I'll be digging deeper in a bit, but >> am hoping to get some more info before. Thanks! >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> The Internals of Spark SQL https://bit.ly/spark-sql-internals >> The Internals of Spark Structured Streaming >> https://bit.ly/spark-structured-streaming >> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >> Follow me at https://twitter.com/jaceklaskowski >> >>
Re: [SS] number of output rows metric for streaming aggregation (StateStoreSaveExec) in Append output mode not measured?
Hi, That was really quick! #impressed Thanks HeartSaVioR for such prompt response! I'm fine with the current state of the issue = no need to change anything. Whatever makes Spark more shiny WFM! Jacek On Sun, 13 Oct 2019, 09:19 Jungtaek Lim, wrote: > Filed SPARK-29450 [1] and raised a patch [2]. Please let me know if you > would like to be assigned as a reporter of SPARK-29450. > > 1. https://issues.apache.org/jira/browse/SPARK-29450 > 2. https://github.com/apache/spark/pull/26104 > > On Sun, Oct 13, 2019 at 4:06 PM Jungtaek Lim > wrote: > >> Thanks for reporting. That might be possible it could be intentionally >> excluded as it could cause some confusion before introducing empty batch >> (given output rows are irrelevant to the input rows in current batch), but >> given we have empty batch I'm not seeing the reason why we don't deal with >> it. I'll file and submit a patch. >> >> Btw, there's a metric bug with empty batch as well - see SPARK-29314 [1] >> which I've submitted a patch recently. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> 1. http://issues.apache.org/jira/browse/SPARK-29314 >> >> >> On Sun, Oct 13, 2019 at 1:12 AM Jacek Laskowski wrote: >> >>> Hi, >>> >>> I use Spark 2.4.4 >>> >>> I've just noticed that the number of output rows metric >>> of StateStoreSaveExec physical operator does not seem to be measured for >>> Append output mode. In other words, whatever happens before or >>> after StateStoreSaveExec operator the metric is always 0. >>> >>> It is measured for the other modes - Complete and Update. >>> >>> See >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L329-L365 >>> >>> Is this intentional? Why? >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> >>> https://about.me/JacekLaskowski >>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>> The Internals of Spark Structured Streaming >>> https://bit.ly/spark-structured-streaming >>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>> Follow me at https://twitter.com/jaceklaskowski >>> >>>
[SS] number of output rows metric for streaming aggregation (StateStoreSaveExec) in Append output mode not measured?
Hi, I use Spark 2.4.4 I've just noticed that the number of output rows metric of StateStoreSaveExec physical operator does not seem to be measured for Append output mode. In other words, whatever happens before or after StateStoreSaveExec operator the metric is always 0. It is measured for the other modes - Complete and Update. See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L329-L365 Is this intentional? Why? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Hi, Thanks much for such thorough conversation. Enjoyed it very much. > Source/Sink traits are in org.apache.spark.sql.execution and thus they are private. That would explain why I couldn't find scaladocs. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Wed, Oct 9, 2019 at 7:46 AM Wenchen Fan wrote: > > Would you mind if I ask the condition of being public API? > > The module doesn't matter, but the package matters. We have many public > APIs in the catalyst module as well. (e.g. DataType) > > There are 3 packages in Spark SQL that are meant to be private: > 1. org.apache.spark.sql.catalyst > 2. org.apache.spark.sql.execution > 3. org.apache.spark.sql.internal > > You can check out the full list of private packages of Spark in > project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages > > Basically, classes/interfaces that don't appear in the official Spark API > doc are private. > > Source/Sink traits are in org.apache.spark.sql.execution and thus they are > private. > > On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim > wrote: > >> Would you mind if I ask the condition of being public API? Source/Sink >> traits are not marked as @DeveloperApi but they're defined as public, and >> located to sql-core so even not semantically private (for catalyst), easy >> to give a signal they're public APIs. >> >> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] >> is not available even for private API. There're some other approaches on >> using private API: 1) SQLContext.internalCreateDataFrame - as it requires >> RDD[InternalRow], they should also depend on catalyst and have to deal with >> InternalRow which Spark community seems to be desired to change it >> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in >> catalyst. So they not only need to apply "package hack" but also need to >> depend on catalyst. >> >> >> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan wrote: >> >>> AFAIK there is no public streaming data source API before DS v2. The >>> Source and Sink API is private and is only for builtin streaming sources. >>> Advanced users can still implement custom stream sources with private Spark >>> APIs (you can put your classes under the org.apache.spark.sql package to >>> access the private methods). >>> >>> That said, DS v2 is the first public streaming data source API. It's >>> really hard to design a stable, efficient and flexible data source API that >>> is unified between batch and streaming. DS v2 has evolved a lot in the >>> master branch and hopefully there will be no big breaking changes anymore. >>> >>> >>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> I remembered the actual case from developer who implements custom data >>>> source. >>>> >>>> >>>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E >>>> >>>> Quoting here: >>>> We started implementing DSv2 in the 2.4 branch, but quickly discovered >>>> that the DSv2 in 3.0 was a complete breaking change (to the point where it >>>> could have been named DSv3 and it wouldn’t have come as a surprise). Since >>>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided >>>> to fall back into DSv1 in order to ease the future transition to Spark 3. >>>> >>>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution >>>> on dealing with DSv2 breaking change is having DSv1 as temporary solution, >>>> even DSv2 for 3.x will be available. They need some time to make >>>> transition. >>>> >>>> I would file an issue to support streaming data source on DSv1 and >>>> submit a patch unless someone objects. >>>> >>>> >>>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski wrote: >>>> >>>>> Hi Jungtaek, >>>>> >>>>> Thanks a lot for your very prompt response! >>>>> >>>>> > Looks like it's missing, or intended to force custom streaming >>>>> source implemen
[SS][2.4.4] Confused with "WatermarkTracker: Event time watermark didn't move"?
Hi, I haven't spent much time on it, but the following DEBUG message from WatermarkTracker sparked my interest :) I ran a streaming aggregation in Append mode and got the messages: 19/10/08 10:48:56 DEBUG WatermarkTracker: Observed event time stats 0: EventTimeStats(15000,1000,8000.0,2) 19/10/08 10:48:56 INFO WatermarkTracker: Updating event-time watermark from 0 to 5000 ms 19/10/08 10:48:56 DEBUG WatermarkTracker: Event time watermark didn't move: 5000 < 5000 I think the DEBUG message "Event time watermark didn't move" seems incorrect given that the query has just started and "Observed event time stats". It's true that the event-time watermark didn't move if it was 5000 before, but it was not as it has just started from scratch (no checkpointed state). Can anyone shed some light on this? I'll be digging deeper in a bit, but am hoping to get some more info before. Thanks! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Hi Jungtaek, Thanks a lot for your very prompt response! > Looks like it's missing, or intended to force custom streaming source implemented as DSv2. That's exactly my understanding = no more DSv1 data sources. That however is not consistent with the official message, is it? Spark 2.4.4 does not actually say "we're abandoning DSv1", and people could not really want to jump on DSv2 since it's not recommended (unless I missed that). I love surprises (as that's where people pay more for consulting :)), but not necessarily before public talks (with one at SparkAISummit in two weeks!) Gonna be challenging! Hope I won't spread a wrong word. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim wrote: > Looks like it's missing, or intended to force custom streaming source > implemented as DSv2. > > I'm not sure Spark community wants to expand DSv1 API: I could propose the > change if we get some supports here. > > To Spark community: given we bring major changes on DSv2, someone would > want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and > new DSv2 gets stabilized. Would we like to provide necessary changes on > DSv1? > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski wrote: > >> Hi, >> >> I think I've got stuck and without your help I won't move any further. >> Please help. >> >> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, >> MicroBatch) and in getBatch phase when requested for a DataFrame, there is >> this assert [1] I can't seem to go past with any DataFrame I managed to >> create as it's not streaming. >> >> assert(batch.isStreaming, >> s"DataFrame returned by getBatch from $source did not have >> isStreaming=true\n" + >> s"${batch.queryExecution.logical}") >> >> [1] >> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 >> >> All I could find is private[sql], >> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3] >> >> [2] >> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 >> [3] >> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> The Internals of Spark SQL https://bit.ly/spark-sql-internals >> The Internals of Spark Structured Streaming >> https://bit.ly/spark-structured-streaming >> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >> Follow me at https://twitter.com/jaceklaskowski >> >>
[SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?
Hi, I think I've got stuck and without your help I won't move any further. Please help. I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, MicroBatch) and in getBatch phase when requested for a DataFrame, there is this assert [1] I can't seem to go past with any DataFrame I managed to create as it's not streaming. assert(batch.isStreaming, s"DataFrame returned by getBatch from $source did not have isStreaming=true\n" + s"${batch.queryExecution.logical}") [1] https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 All I could find is private[sql], e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3] [2] https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 [3] https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
Re: Welcoming some new committers and PMC members
Hi, What a great news! Congrats to all awarded and the community for voting them in! p.s. I think it should go to the user mailing list too. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers and one PMC > member. Join me in welcoming them to their new roles! > > New PMC member: Dongjoon Hyun > > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang, > Weichen Xu, Ruifeng Zheng > > The new committers cover lots of important areas including ML, SQL, and > data sources, so it’s great to have them here. All the best, > > Matei and the Spark PMC > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Why two netty libs?
Hi, Thanks much for the answers. Learning Spark every day! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Wed, Sep 4, 2019 at 3:15 PM Sean Owen wrote: > Yes that's right. I don't think Spark's usage of ZK needs any ZK > server, so it's safe to exclude in Spark (at least, so far so good!) > > On Wed, Sep 4, 2019 at 8:06 AM Steve Loughran > wrote: > > > > Zookeeper client is/was netty 3, AFAIK, so if you want to use it for > anything, it ends up on the CP > > > > On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> > >> Yep, historical reasons. And Netty 4 is under another namespace, so we > can use Netty 3 and Netty 4 in the same JVM. > >> > >> On Tue, Sep 3, 2019 at 6:15 AM Sean Owen wrote: > >>> > >>> It was for historical reasons; some other transitive dependencies > needed it. > >>> I actually was just able to exclude Netty 3 last week from master. > >>> Spark uses Netty 4. > >>> > >>> On Tue, Sep 3, 2019 at 6:59 AM Jacek Laskowski > wrote: > >>> > > >>> > Hi, > >>> > > >>> > Just noticed that Spark 2.4.x uses two netty deps of different > versions. Why? > >>> > > >>> > jars/netty-all-4.1.17.Final.jar > >>> > jars/netty-3.9.9.Final.jar > >>> > > >>> > Shouldn't one be excluded or perhaps shaded? > >>> > > >>> > Pozdrawiam, > >>> > Jacek Laskowski > >>> > > >>> > https://about.me/JacekLaskowski > >>> > The Internals of Spark SQL https://bit.ly/spark-sql-internals > >>> > The Internals of Spark Structured Streaming > https://bit.ly/spark-structured-streaming > >>> > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals > >>> > Follow me at https://twitter.com/jaceklaskowski > >>> > > >>> > >>> - > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > >> -- > >> > >> Best Regards, > >> > >> Ryan > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Why two netty libs?
Hi, Just noticed that Spark 2.4.x uses two netty deps of different versions. Why? jars/netty-all-4.1.17.Final.jar jars/netty-3.9.9.Final.jar Shouldn't one be excluded or perhaps shaded? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
Re: [SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?
Hi Devs, Thanks all for a very prompt response! That was insanely quick. Merci beaucoup! :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Mon, Aug 26, 2019 at 4:05 PM Jungtaek Lim wrote: > Thanks! The patch is here: https://github.com/apache/spark/pull/25583 > > On Mon, Aug 26, 2019 at 11:02 PM Gabor Somogyi > wrote: > >> Just checked this and it's a copy-paste :) It works properly when >> KafkaSourceInitialOffsetWriter used. Pull me in if review needed. >> >> BR, >> G >> >> >> On Mon, Aug 26, 2019 at 3:57 PM Jungtaek Lim wrote: >> >>> Nice finding! I don't see any reason to not use >>> KafkaSourceInitialOffsetWriter from KafkaSource, as they're identical. I >>> guess it was copied and pasted sometime before and not addressed yet. >>> As you haven't submit a patch, I'll submit a patch shortly, with >>> mentioning credit. I'd close mine and wait for your patch if you plan to do >>> it. Please let me know. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> >>> On Mon, Aug 26, 2019 at 8:03 PM Jacek Laskowski wrote: >>> >>>> Hi, >>>> >>>> Just found out that KafkaSource [1] does not >>>> use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for >>>> initial offsets. >>>> >>>> Any reason for that? Should I report an issue? Just checking out as I'm >>>> with 2.4.3 exclusively and have no idea what's coming for 3.0. >>>> >>>> [1] >>>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102 >>>> >>>> [2] >>>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281 >>>> >>>> Pozdrawiam, >>>> Jacek Laskowski >>>> >>>> https://about.me/JacekLaskowski >>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>>> The Internals of Spark Structured Streaming >>>> https://bit.ly/spark-structured-streaming >>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>>> Follow me at https://twitter.com/jaceklaskowski >>>> >>>> >>> >>> -- >>> Name : Jungtaek Lim >>> Blog : http://medium.com/@heartsavior >>> Twitter : http://twitter.com/heartsavior >>> LinkedIn : http://www.linkedin.com/in/heartsavior >>> >> > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior >
[SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?
Hi, Just found out that KafkaSource [1] does not use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for initial offsets. Any reason for that? Should I report an issue? Just checking out as I'm with 2.4.3 exclusively and have no idea what's coming for 3.0. [1] https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102 [2] https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
Re: Release Apache Spark 2.4.4 before 3.0.0
Hi, Thanks Dongjoon Hyun for stepping up as a release manager! Much appreciated. If there's a volunteer to cut a release, I'm always to support it. In addition, the more frequent releases the better for end users so they have a choice to upgrade and have all the latest fixes or wait. It's their call not ours (when we'd keep them waiting). My big 2 yes'es for the release! Jacek On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, wrote: > Hi, All. > > Spark 2.4.3 was released two months ago (8th May). > > As of today (9th July), there exist 45 fixes in `branch-2.4` including the > following correctness or blocker issues. > > - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for > decimals not fitting in long > - SPARK-26045 Error in the spark 2.4 release package with the > spark-avro_2.11 dependency > - SPARK-27798 from_avro can modify variables in other rows in local > mode > - SPARK-27907 HiveUDAF should return NULL in case of 0 rows > - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries > - SPARK-28308 CalendarInterval sub-second part should be padded before > parsing > > It would be great if we can have Spark 2.4.4 before we are going to get > busier for 3.0.0. > If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll > it next Monday. (15th July). > How do you think about this? > > Bests, > Dongjoon. >
Re: [SS] Why EventTimeStatsAccum for event-time watermark not a named accumulator?
Hi, After some thinking about it, I may have found out the reason why not to expose EventTimeStatsAccum as a named accumulator. The reason is that it's an internal part of how event-time watermark works and should not be exposed via web UI as much as if it was part of a Spark app (the web UI is meant for). With that being said, I'm wondering why is EventTimeStatsAccum not a SQL metric then? With that, it'd be in web UI, but just in the physical plan of a streaming query. WDYT? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Mon, Jun 10, 2019 at 8:59 PM Jacek Laskowski wrote: > Hi, > > I'm curious why EventTimeStatsAccum is not a named accumulator (not to > mention SQLMetric) so the event-time watermark could be monitored in web UI? > > I've changed the code for EventTimeWatermarkExec physical operator to > register EventTimeStatsAccum as a named accumulator and the values are > properly propagated back to the driver and the web UI. It seems to be > working fine (and it's just a one-day coding). > > It went fairly easy to have a very initial prototype so I'm wondering why > it's not been included? Has this been considered? Should I file a > JIRA ticket and send a pull request for review? Please guide as I found it > very helpful (and surprisingly easy to implement so I'm worried I'm missing > something important). Thanks. > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > The Internals of Spark SQL https://bit.ly/spark-sql-internals > The Internals of Spark Structured Streaming > https://bit.ly/spark-structured-streaming > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals > Follow me at https://twitter.com/jaceklaskowski > >
[SS] Why EventTimeStatsAccum for event-time watermark not a named accumulator?
Hi, I'm curious why EventTimeStatsAccum is not a named accumulator (not to mention SQLMetric) so the event-time watermark could be monitored in web UI? I've changed the code for EventTimeWatermarkExec physical operator to register EventTimeStatsAccum as a named accumulator and the values are properly propagated back to the driver and the web UI. It seems to be working fine (and it's just a one-day coding). It went fairly easy to have a very initial prototype so I'm wondering why it's not been included? Has this been considered? Should I file a JIRA ticket and send a pull request for review? Please guide as I found it very helpful (and surprisingly easy to implement so I'm worried I'm missing something important). Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
[SS] ContinuousExecution.commit and excessive JSON serialization?
Hi, Why does ContinuousExecution.commit serialize an offset to JSON format just before deserializing it back (from JSON to an offset)? [1] val offset = sources(0).deserializeOffset(offsetLog.get(epoch).get.offsets(0).get.json) [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L341 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski
FileSourceScanExec.doExecute - when is this executed if ever?
Hi, I may have asked this question before, but seems I forgot/can't find the answer. When is FileSourceScanExec.doExecute executed if ever? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: Is there a way to read a Parquet File as ColumnarBatch?
Hi Priyanka, I've been exploring this part of Spark SQL and could help a little bit. > but for some reason it never hit the breakpoints I placed in these classes. Was this for local[*]? I ran "SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005" ./bin/spark-shell" and attached IDEA to debug the code. I used Spark 2.4.1 (with Scala 2.12) and it worked fine for the following queries: spark.range(5).write.save("hello") spark.read.parquet("hello").show > has anyone tried returning all the data as ColumnarBatch? Is there any reading material you can point me to? You may find some information in the internals book at https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnarBatch.html It's WIP. Let me know what part to explore in more detail. I'd do this momentarily (as I'm exploring parquet data source in more detail as we speak). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Apr 22, 2019 at 10:29 AM Priyanka Gomatam wrote: > Hi, > > I am new to Spark and have been playing around with the Parquet reader > code. I have two questions: > >1. I saw the code that starts at DataSourceScanExec class, and moves >on to the ParquetFileFormat class and does a VectorizedParquetRecordReader. >I tried doing a spark.read.parquet(…) and debugged through the code, but >for some reason it never hit the breakpoints I placed in these classes. >Perhaps I am doing something wrong, but is there a certain versioning for >parquet readers that I am missing out on? How do I make the code take the >DataSourceScanExec -> … -> ParquetReader … -> VectorizedParqeutRecordRead … >route? >2. If I do manage to make it take the above path, I see there is a >point at which the data is filled into ColumnarBatch objects, has anyone >tried returning all the data as ColumnarBatch? Is there any reading >material you can point me to? > > Thanks in advance, this will be super helpful for me! >
Re: Sort order in bucketing in a custom datasource
Hi, I don't think so. I can't think of an interface (trait) that would give that information to the Catalyst optimizer. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Tue, Apr 16, 2019 at 6:31 PM Long, Andrew wrote: > Hey Friends, > > > > Is it possible to specify the sort order or bucketing in a way that can be > used by the optimizer in spark? > > > > Cheers Andrew >
Re: Static functions
Hi Jean, I thought the functions have already been tagged? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sun, Feb 10, 2019 at 11:48 PM Jean Georges Perrin wrote: > Hey guys, > > We have 381 static functions now (including the deprecated). I am trying > to sort them out by group/tag them. > > So far, I have: > >- Array >- Conversion >- Date >- Math > - Trigo (sub group of maths) >- Security >- Streaming >- String >- Technical > > Do you see more categories? Tags? > > Thanks! > > jg > > — > Jean Georges Perrin / @jgperrin > >
Re: [SS] FlatMapGroupsWithStateExec with no commitTimeMs metric?
Thanks Jungtaek Lim! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Nov 26, 2018 at 7:26 AM Jungtaek Lim wrote: > Looks like just a kind of missing spot. Just crafted the patch and now in > progress of testing. > > Once done with testing I'll file a new issue and submit a patch. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 2018년 11월 26일 (월) 오전 4:49, Burak Yavuz 님이 작성: > >> Probably just oversight. Anyone is welcome to add it :) >> >> On Sun, Nov 25, 2018 at 8:55 AM Jacek Laskowski wrote: >> >>> Hi, >>> >>> Why is FlatMapGroupsWithStateExec not measuring the time taken on state >>> commit [1](like StreamingDeduplicateExec [2] and StreamingGlobalLimitExec >>> [3])? Is this on purpose? Why? Thanks. >>> >>> [1] >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala#L135 >>> >>> [2] >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L482 >>> >>> [3] >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingGlobalLimitExec.scala#L87 >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> >>> https://about.me/JacekLaskowski >>> Mastering Spark SQL https://bit.ly/mastering-spark-sql >>> Spark Structured Streaming https://bit.ly/spark-structured-streaming >>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >>> Follow me at https://twitter.com/jaceklaskowski >>> >>
[SS] FlatMapGroupsWithStateExec with no commitTimeMs metric?
Hi, Why is FlatMapGroupsWithStateExec not measuring the time taken on state commit [1](like StreamingDeduplicateExec [2] and StreamingGlobalLimitExec [3])? Is this on purpose? Why? Thanks. [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala#L135 [2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L482 [3] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingGlobalLimitExec.scala#L87 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: Is spark.sql.codegen.factoryMode property really for tests only?
Hi Marco, Many thanks for such a quick response. With that, I'll direct my curiosity into a different direction. Thanks! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Fri, Nov 16, 2018 at 1:44 PM Marco Gaido wrote: > Hi Jacek, > > I do believe it is correct. Please check the method you mentioned > (CodeGeneratorWithInterpretedFallback.createObject): the value is relevant > only if Utils.isTesting. > > Thanks, > Marco > > Il giorno ven 16 nov 2018 alle ore 13:28 Jacek Laskowski > ha scritto: > >> Hi, >> >> While reviewing the changes in 2.4 I stumbled >> upon spark.sql.codegen.factoryMode internal configuration property [1]. The >> doc says: >> >> > Note that this config works only for tests. >> >> Is that correct? I've got some doubts. >> >> I found that it's used in UnsafeProjection.create [2] (through >> CodeGeneratorWithInterpretedFallback.createObject) which is used outside >> the tests and so made me think if "this config works only for tests" part >> is correct. >> >> Are my doubts correct? If not, what am I missing? Thanks. >> >> [1] >> https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L758-L767 >> >> [2] >> https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L159 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >
Is spark.sql.codegen.factoryMode property really for tests only?
Hi, While reviewing the changes in 2.4 I stumbled upon spark.sql.codegen.factoryMode internal configuration property [1]. The doc says: > Note that this config works only for tests. Is that correct? I've got some doubts. I found that it's used in UnsafeProjection.create [2] (through CodeGeneratorWithInterpretedFallback.createObject) which is used outside the tests and so made me think if "this config works only for tests" part is correct. Are my doubts correct? If not, what am I missing? Thanks. [1] https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L758-L767 [2] https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L159 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
How to know all the issues resolved for 2.4.0?
Hi, I've been trying to find out the issues that are part of 2.4.0 and used the following query: project = SPARK AND resolution in (Resolved, Done, Fixed) and "Target Version/s" = "2.4.0" I got 202 issues. Is that correct? What's the difference between the Resolution statuses: Resolved, Done, Fixed? When is an issue marked as either of them? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Why does spark.range(1).write.mode("overwrite").saveAsTable("t1") throw an Exception?
Hi, Just ran into it today and wonder whether it's a bug or something I may have missed before. scala> spark.version res21: String = 2.3.2 // that's OK scala> spark.range(1).write.saveAsTable("t1") org.apache.spark.sql.AnalysisException: Table `t1` already exists.; at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393) ... 51 elided // Let's overwrite it then // An exception?! Why?! scala> spark.range(1).write.mode("overwrite").saveAsTable("t1") org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at scala.Option.getOrElse(Option.scala:121) ... // If the above works properly, why does the following work fine (and not throw an exception)? scala> spark.range(1).write.saveAsTable("t10") p.s. I was not sure whether I should be sending the question to dev or users so accept my apologizes when sent to a wrong mailing list. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: welcome a new batch of committers
Wow! That's a nice bunch of contributors. Congrats to all new committers. I've had tough times to follow all the contributions, but with this crew it's gonna be nearly impossible. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Wed, Oct 3, 2018 at 10:59 AM Reynold Xin wrote: > Hi all, > > The Apache Spark PMC has recently voted to add several new committers to > the project, for their contributions: > > - Shane Knapp (contributor to infra) > - Dongjoon Hyun (contributor to ORC support and other parts of Spark) > - Kazuaki Ishizaki (contributor to Spark SQL) > - Xingbo Jiang (contributor to Spark Core and SQL) > - Yinan Li (contributor to Spark on Kubernetes) > - Takeshi Yamamuro (contributor to Spark SQL) > > Please join me in welcoming them! > >
Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?
Hi, OK. Sorry for the noise. I don't know why it started working, but I cannot reproduce it anymore. Sorry for a false alarm (but I could promise it didn't work and I changed nothing). Back to work... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Oct 1, 2018 at 8:45 AM Jyoti Ranjan Mahapatra wrote: > Hi Jacek, > > > > The issue might not be very widespread. I couldn’t reproduce it. Can you > see if I am doing anything incorrect in the below queries? > > > > scala> spark.range(10).write.saveAsTable("t1") > > > > scala> spark.sql("describe formatted t1").show(100, false) > > > ++---+---+ > > |col_name > |data_type > |comment| > > > ++---+---+ > > |id > |bigint > |null | > > | >| > | | > > |# Detailed Table > Information| > | | > > |Database > |default > | | > > |Table > |t1 > | | > > |Owner > |jyotima > | | > > |Created Time|Sun Sep 30 23:40:46 PDT > 2018 | | > > |Last Access |Wed Dec 31 16:00:00 PST > 1969 | | > > |Created By |Spark > 2.3.2 > | | > > |Type|MANAGED >| | > > |Provider > |parquet > | | > > |Table Properties > |[transient_lastDdlTime=1538376046] > | | > > |Statistics |3008 > bytes > | | > > |Location > |file:/home/jyotima/repo/tmp/spark2.3.2/spark-2.3.2-bin-hadoop2.7/spark-warehouse/t1| > | > > |Serde Library > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > | | > > |InputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > | | > > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > | | > > |Storage Properties > |[serialization.format=1] >| | > > > ++---+---+ > > > > scala> spark.version > > res4: String = 2.3.2 > > > > Thanks, > > Jyoti > > *From:* Jacek Laskowski > *Sent:* Sunday, September 30, 2018 11:28 PM > *To:* Sean Owen > *Cc:* dev > *Subject:* Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works > fine? > > > > Hi Sean, > > > > Thanks again for helping me to remain sane and that the issue is not > imaginary :) > > > > I'd expect to be spark-warehouse in the directory where spark-shell is > executed (which is what has always been used for the metastore). > > > > I'm reviewing all the changes between 2.3.1..2.3.2 to find anything > relevant. I'm surprised nobody's reported it before. That worries me (or > simply says that all the enterprise deployments simply use YARN with Hive?) > > > Pozdrawiam, > > Jacek Laskowski > > > > https://about.me/JacekLaskowski > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2FJacekLaskowski&data=02%7C01%7Cjyotima%40microsoft.com%7C7d66eb6f8e7a44ffedbe08d627672c86%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636739721428657573&sdata=9SJym%2B41JIxnZnRvtdBkGoV0DFl7YEBRK7ZTa1XsSMQ%3D&reserved=0> > > Mastering Spark SQL https://bit.ly/mastering-spark-sql > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbit.ly%2Fmastering-spark-sql&data=02%7C01%7Cjyotima%40microsoft.com%7C7d66eb6f8e7a44ffedbe08d627672c86%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636739721428667587&sdata=wewZO8MBXR9dM8zF1FGK%2FjXxlEOb%2FFqQc8LDKSBW66A%3D&reserved=0> > > Spark Structured Streaming https://bit.ly/spark-structured-streaming > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbit.ly%2Fspark-structured-streaming&data=02%7C01%7Cjyotima%40microsoft.com%7C7d66eb6f8e7a44ffedbe08d627672c86%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636739721428677578&sdata=TdX6tZltzBTn1vrB5N4ugqoshBD7qBks2Q1AW%2F
Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?
Hi Sean, Thanks again for helping me to remain sane and that the issue is not imaginary :) I'd expect to be spark-warehouse in the directory where spark-shell is executed (which is what has always been used for the metastore). I'm reviewing all the changes between 2.3.1..2.3.2 to find anything relevant. I'm surprised nobody's reported it before. That worries me (or simply says that all the enterprise deployments simply use YARN with Hive?) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sun, Sep 30, 2018 at 10:25 PM Sean Owen wrote: > Hm, changes in the behavior of the default warehouse dir sound > familiar, but anything I could find was resolved well before 2.3.1 > even. I don't know of a change here. What location are you expecting? > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12343289 > On Sun, Sep 30, 2018 at 1:38 PM Jacek Laskowski wrote: > > > > Hi Sean, > > > > I thought so too, but the path "file:/user/hive/warehouse/" should not > have been used in the first place, should it? I'm running it in spark-shell > 2.3.2. Why would there be any changes between 2.3.1 and 2.3.2 that I just > downloaded and one worked fine while the other did not? I had to downgrade > to 2.3.1 because of this (and do want to figure out why 2.3.2 behaves in a > different way). > > > > The part of the stack trace is below. > > > > ➜ spark-2.3.2-bin-hadoop2.7 ./bin/spark-shell > > 2018-09-30 17:43:49 WARN NativeCodeLoader:62 - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > > Setting default log level to "WARN". > > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > > Spark context Web UI available at http://192.168.0.186:4040 > > Spark context available as 'sc' (master = local[*], app id = > local-1538322235135). > > Spark session available as 'spark'. > > Welcome to > > __ > > / __/__ ___ _/ /__ > > _\ \/ _ \/ _ `/ __/ '_/ > >/___/ .__/\_,_/_/ /_/\_\ version 2.3.2 > > /_/ > > > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_171) > > Type in expressions to have them evaluated. > > Type :help for more information. > > > > scala> spark.version > > res0: String = 2.3.2 > > > > scala> spark.range(1).write.saveAsTable("demo") > > 2018-09-30 17:44:27 WARN ObjectStore:568 - Failed to get database > global_temp, returning NoSuchObjectException > > 2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs failed to > create file:/user/hive/warehouse/demo/_temporary/0 > > 2018-09-30 17:44:28 ERROR Utils:91 - Aborting task > > java.io.IOException: Mkdirs failed to create > file:/user/hive/warehouse/demo/_temporary/0/_temporary/attempt_20180930174428__m_07_0 > (exists=false, cwd=file:/Users/jacek/dev/apps/spark-2.3.2-bin-hadoop2.7) > > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455) > > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) > > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) > > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789) > > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:241) > > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) > > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:26
Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?
Hi Sean, I thought so too, but the path "file:/user/hive/warehouse/" should not have been used in the first place, should it? I'm running it in spark-shell 2.3.2. Why would there be any changes between 2.3.1 and 2.3.2 that I just downloaded and one worked fine while the other did not? I had to downgrade to 2.3.1 because of this (and do want to figure out why 2.3.2 behaves in a different way). The part of the stack trace is below. ➜ spark-2.3.2-bin-hadoop2.7 ./bin/spark-shell 2018-09-30 17:43:49 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.0.186:4040 Spark context available as 'sc' (master = local[*], app id = local-1538322235135). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Type in expressions to have them evaluated. Type :help for more information. scala> spark.version res0: String = 2.3.2 scala> spark.range(1).write.saveAsTable("demo") 2018-09-30 17:44:27 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException 2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs failed to create file:/user/hive/warehouse/demo/_temporary/0 2018-09-30 17:44:28 ERROR Utils:91 - Aborting task java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/demo/_temporary/0/_temporary/attempt_20180930174428__m_07_0 (exists=false, cwd=file:/Users/jacek/dev/apps/spark-2.3.2-bin-hadoop2.7) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:241) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sat, Sep 29, 2018 at 9:50 PM Sean Owen wrote: > Looks like a permission issue? Are you sure that isn't the difference, > first? > > On Sat, Sep 29, 2018, 1:54 PM Jacek Laskowski wrote: > >> Hi, >> >> The following query fails in 2.3.2: >> >> scala> spark.range(10).write.saveAsTable("t1") >> ... >> 2018-09-29 20:48:06 ERROR FileOutputCommitter:314 - Mkdirs fai
saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?
Hi, The following query fails in 2.3.2: scala> spark.range(10).write.saveAsTable("t1") ... 2018-09-29 20:48:06 ERROR FileOutputCommitter:314 - Mkdirs failed to create file:/user/hive/warehouse/bucketed/_temporary/0 2018-09-29 20:48:07 ERROR Utils:91 - Aborting task java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/bucketed/_temporary/0/_temporary/attempt_20180929204807__m_03_0 (exists=false, cwd=file:/Users/jacek/dev/apps/spark-2.3.2-bin-hadoop2.7) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) While it works fine in 2.3.1. Could anybody explain the change in behaviour in 2.3.2? The commit / the JIRA issue would be even nicer. Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap
Hi Herman, Right. No @deprecated, but something that would tell people who review the code "be extra careful since you're reading code that is no longer in use" for SparkPlans that do support WSCG. That would help a lot as I got tricked few times already while trying to understand something that I should not have been bothered much with. Thanks Russ and Herman for your help to get my thinking right. That will also help my Spark clients, esp. during Spark SQL workshops! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sat, Sep 8, 2018 at 3:53 PM Herman van Hovell wrote: > ...pressed send to early... > > Moreover the we can't always use whole stage code generation. In that case > we fall back to vulcano style execution, and chain together doExecute() > calls. > > On Sat, Sep 8, 2018 at 3:51 PM Herman van Hovell > wrote: > >> SparkPlan.doExecute() is the only way you can execute a physical SQL >> plan, so it should *not* be marked as deprecated. Wholestage code >> generation collapses a subtree of SparkPlans (that support whole stage >> codegeneration) into a single WholeStageCodegenExec pyhsical plan. >> During execution we call doExecute() on the WholeStageCodegenExec node. >> >> On Sat, Sep 8, 2018 at 11:55 AM Jacek Laskowski wrote: >> >>> Thanks Russ! That helps a lot. >>> >>> On the other hand makes reviewing the codebase of Spark SQL slightly >>> harder since Java code generation is so much about string concatenation :( >>> >>> p.s. Should all the code in doExecute be considered and marked >>> @deprecated? >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> >>> https://about.me/JacekLaskowski >>> Mastering Spark SQL https://bit.ly/mastering-spark-sql >>> Spark Structured Streaming https://bit.ly/spark-structured-streaming >>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >>> Follow me at https://twitter.com/jaceklaskowski >>> >>> >>> On Fri, Sep 7, 2018 at 10:05 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> That's my understanding :) doExecute is for non-codegen while doProduce >>>> and Consume are for generating code >>>> >>>> On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski wrote: >>>> >>>>> Hi Devs, >>>>> >>>>> Sorry for bothering you with my questions (and concerns), but I really >>>>> need to understand this piece of code (= my personal challenge :)) >>>>> >>>>> Is this true that SparkPlan.doExecute (to "execute" a physical >>>>> operator) is only used when whole-stage code gen is disabled (which is not >>>>> by default)? May I call this execution path traditional (even >>>>> "old-fashioned")? >>>>> >>>>> Is this true that these days SparkPlan.doProduce and >>>>> SparkPlan.doConsume (and others) are used for "executing" a physical >>>>> operator (i.e. to generate the Java source code) since whole-stage code >>>>> generation is enabled and is currently the proper execution path? >>>>> >>>>> p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen >>>>> by WholeStageCodegenExec (and InputAdapter), but that's all the code that >>>>> is to be executed by doExecute, isn't it? >>>>> >>>>> Pozdrawiam, >>>>> Jacek Laskowski >>>>> >>>>> https://about.me/JacekLaskowski >>>>> Mastering Spark SQL https://bit.ly/mastering-spark-sql >>>>> Spark Structured Streaming https://bit.ly/spark-structured-streaming >>>>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >>>>> Follow me at https://twitter.com/jaceklaskowski >>>>> >>>>> >>>>> On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski >>>>> wrote: >>>>> >>>>>> Hi Spark Devs, >>>>>> >>>>>> I really need your help understanding the relationship >>>>>> between HashAggregateExec, TungstenAggregationIterator and >>>>>> UnsafeFixedWidthAggregationMap. >>>>>> >>>>>> W
Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap
Thanks Russ! That helps a lot. On the other hand makes reviewing the codebase of Spark SQL slightly harder since Java code generation is so much about string concatenation :( p.s. Should all the code in doExecute be considered and marked @deprecated? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Fri, Sep 7, 2018 at 10:05 PM Russell Spitzer wrote: > That's my understanding :) doExecute is for non-codegen while doProduce > and Consume are for generating code > > On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski wrote: > >> Hi Devs, >> >> Sorry for bothering you with my questions (and concerns), but I really >> need to understand this piece of code (= my personal challenge :)) >> >> Is this true that SparkPlan.doExecute (to "execute" a physical operator) >> is only used when whole-stage code gen is disabled (which is not by >> default)? May I call this execution path traditional (even "old-fashioned")? >> >> Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume >> (and others) are used for "executing" a physical operator (i.e. to generate >> the Java source code) since whole-stage code generation is enabled and is >> currently the proper execution path? >> >> p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen >> by WholeStageCodegenExec (and InputAdapter), but that's all the code that >> is to be executed by doExecute, isn't it? >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski wrote: >> >>> Hi Spark Devs, >>> >>> I really need your help understanding the relationship >>> between HashAggregateExec, TungstenAggregationIterator and >>> UnsafeFixedWidthAggregationMap. >>> >>> While exploring UnsafeFixedWidthAggregationMap and how it's used I've >>> noticed that it's for HashAggregateExec and TungstenAggregationIterator >>> exclusively. And given that TungstenAggregationIterator is used exclusively >>> in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both >>> seems to be almost the same (if not the same), I've got a question I cannot >>> seem to answer myself. >>> >>> Since HashAggregateExec supports Whole-Stage Codegen >>> HashAggregateExec.doExecute won't be used at all, but doConsume and >>> doProduce (unless codegen is disabled). Is that correct? >>> >>> If so, TungstenAggregationIterator is not used at all, but >>> UnsafeFixedWidthAggregationMap is used directly instead (in the Java code >>> that uses createHashMap or finishAggregate). Is that correct? >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> >>> https://about.me/JacekLaskowski >>> Mastering Spark SQL https://bit.ly/mastering-spark-sql >>> Spark Structured Streaming https://bit.ly/spark-structured-streaming >>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >>> Follow me at https://twitter.com/jaceklaskowski >>> >>
Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap
Hi Devs, Sorry for bothering you with my questions (and concerns), but I really need to understand this piece of code (= my personal challenge :)) Is this true that SparkPlan.doExecute (to "execute" a physical operator) is only used when whole-stage code gen is disabled (which is not by default)? May I call this execution path traditional (even "old-fashioned")? Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume (and others) are used for "executing" a physical operator (i.e. to generate the Java source code) since whole-stage code generation is enabled and is currently the proper execution path? p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen by WholeStageCodegenExec (and InputAdapter), but that's all the code that is to be executed by doExecute, isn't it? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski wrote: > Hi Spark Devs, > > I really need your help understanding the relationship > between HashAggregateExec, TungstenAggregationIterator and > UnsafeFixedWidthAggregationMap. > > While exploring UnsafeFixedWidthAggregationMap and how it's used I've > noticed that it's for HashAggregateExec and TungstenAggregationIterator > exclusively. And given that TungstenAggregationIterator is used exclusively > in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both > seems to be almost the same (if not the same), I've got a question I cannot > seem to answer myself. > > Since HashAggregateExec supports Whole-Stage Codegen > HashAggregateExec.doExecute won't be used at all, but doConsume and > doProduce (unless codegen is disabled). Is that correct? > > If so, TungstenAggregationIterator is not used at all, but > UnsafeFixedWidthAggregationMap is used directly instead (in the Java code > that uses createHashMap or finishAggregate). Is that correct? > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski >
Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap
Hi Spark Devs, I really need your help understanding the relationship between HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap. While exploring UnsafeFixedWidthAggregationMap and how it's used I've noticed that it's for HashAggregateExec and TungstenAggregationIterator exclusively. And given that TungstenAggregationIterator is used exclusively in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both seems to be almost the same (if not the same), I've got a question I cannot seem to answer myself. Since HashAggregateExec supports Whole-Stage Codegen HashAggregateExec.doExecute won't be used at all, but doConsume and doProduce (unless codegen is disabled). Is that correct? If so, TungstenAggregationIterator is not used at all, but UnsafeFixedWidthAggregationMap is used directly instead (in the Java code that uses createHashMap or finishAggregate). Is that correct? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Why is View logical operator not a UnaryNode explicitly?
Hi, I've just come across View logical operator which is not a UnaryNode explicitly, i.e. "extends UnaryNode". Why? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala?utf8=%E2%9C%93#L460-L463 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Same code in DataFrameWriter.runCommand and Dataset.withAction?
Hi, I'm curious why Spark SQL uses two different methods for the seemingly very same code? * DataFrameWriter.runCommand --> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L663 * Dataset.withAction --> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3317 It looks like the relationship is as follows: DataFrameWriter.runCommand == Dataset.withAction(_.execute) Should one be removed for the other? I'd first change runCommand to use withAction(_.execute) or even remove runCommand altogether. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: Why is SQLImplicits an abstract class rather than a trait?
Hi Assaf, No idea (and don't remember I've ever wondered about it before), but why not doing this (untested): trait MySparkTestTrait { lazy val spark: SparkSession = SparkSession.builder().getOrCreate() // <-- you sure you don't need master? import spark.implicits._ } Wouldn't that import work? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sun, Aug 5, 2018 at 5:34 PM, assaf.mendelson wrote: > Hi all, > > I have been playing a bit with SQLImplicits and noticed that it is an > abstract class. I was wondering why is that? It has no constructor. > > Because of it being an abstract class it means that adding a test trait > cannot extend it and still be a trait. > > Consider the following: > > trait MySparkTestTrait extends SQLImplicits { > lazy val spark: SparkSession = SparkSession.builder().getOrCreate() > protected override def _sqlContext: SQLContext = spark.sqlContext > } > > > This would mean that if I can do something like this: > > > class MyTestClass extends FunSuite with MySparkTestTrait { > test("SomeTest") { > // use spark implicits without needing to do import > spark.implicits._ > } > } > > Is there a reason for this being an abstract class? > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Am I crazy, or does the binary distro not have Kafka integration?
Hi Sean, It's been for years I'd say that you had to specify --packages to get the Kafka-related jars on the classpath. I simply got used to this annoyance (as did others). Could it be that it's an external package (although an integral part of Spark)?! I'm very glad you've brought it up since I think Kafka data source is so important that it should be included in spark-shell and spark-submit by default. THANKS! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen wrote: > Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- I > provisionally marked this a Blocker, as if it's correct, then the release > is missing an important piece and we'll want to remedy that ASAP. I still > have this feeling I am missing something. The classes really aren't there > in the release but ... *nobody* noticed all this time? I guess maybe > Spark-Kafka users may be using a vendor distro that does package these bits. > > > On Sat, Aug 4, 2018 at 10:48 AM Sean Owen wrote: > >> I was debugging why a Kafka-based streaming app doesn't seem to find >> Kafka-related integration classes when run standalone from our latest 2.3.1 >> release, and noticed that there doesn't seem to be any Kafka-related jars >> from Spark in the distro. In jars/, I see: >> >> spark-catalyst_2.11-2.3.1.jar >> spark-core_2.11-2.3.1.jar >> spark-graphx_2.11-2.3.1.jar >> spark-hive-thriftserver_2.11-2.3.1.jar >> spark-hive_2.11-2.3.1.jar >> spark-kubernetes_2.11-2.3.1.jar >> spark-kvstore_2.11-2.3.1.jar >> spark-launcher_2.11-2.3.1.jar >> spark-mesos_2.11-2.3.1.jar >> spark-mllib-local_2.11-2.3.1.jar >> spark-mllib_2.11-2.3.1.jar >> spark-network-common_2.11-2.3.1.jar >> spark-network-shuffle_2.11-2.3.1.jar >> spark-repl_2.11-2.3.1.jar >> spark-sketch_2.11-2.3.1.jar >> spark-sql_2.11-2.3.1.jar >> spark-streaming_2.11-2.3.1.jar >> spark-tags_2.11-2.3.1.jar >> spark-unsafe_2.11-2.3.1.jar >> spark-yarn_2.11-2.3.1.jar >> >> I checked make-distribution.sh, and it copies a bunch of JARs into the >> distro, but does not seem to touch the kafka modules. >> >> Am I crazy or missing something obvious -- those should be in the >> release, right? >> >
Qs on Dataset API -- groups of createXXXTempViews and XXXcheckpoint methods
Hi, I'd appreciate your help on the following two questions about Dataset API: 1. Why do Dataset methods: createTempView, createOrReplaceTempView, createGlobalTempView and createOrReplaceGlobalTempView not return a DataFrame? They seem to be neither actions nor transformations (and probably the reason for a separate group basic). I wonder why they don't return a DataFrame? Where's the catch? 2. Why are localCheckpoint and checkpoint in the basic group? They'd be fine to be with untyped transformations since they do transform a Dataset into a DataFrame. Again, a bit of help would be very helpful. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?
Hi Joseph, Thanks for your explanation. It makes a lot of sense and I found http://spark.apache.org/docs/latest/sql-programming- guide.html#jdbc-to-other-databases giving more. With that and after I reviewed the code, customSchema option is simply to override the data type of the fields in a relation schema [1][2]. I think the name of the option should be different with the word "override" to give the exact meaning, shouldn't it? With that said, I think the description of customSchema option may slightly be incorrect. For example the following says: "The custom schema to use for reading data from JDBC connectors" and although it is used for reading it merely overrides the data types and may not match the fields at all which makes no difference. Is that correct? It's in the following sentence where the word of "type" appears: "You can also specify partial fields, and the others use the default type mapping." But that begs for another question about "the default type mapping". What the default type mapping is? That was one of my questions when I first found the option. What do you think about the following description of the customSchema option. You're welcome to make further changes if needed. customSchema - Specifies the custom data types of the read schema (that is used at load time). customSchema is a comma-separated list of field definitions with column names and their data types in a canonical SQL representation, e.g. id DECIMAL(38, 0), name STRING. customSchema defines the data types of the columns that will override the data types inferred from the table schema and follows the following pattern: colTypeList : colType (',' colType)* ; colType : identifier dataType (COMMENT STRING)? ; dataType : complex=ARRAY '<' dataType '>' #complexDataType | complex=MAP '<' dataType ',' dataType '>' #complexDataType | complex=STRUCT ('<' complexColTypeList? '>' | NEQ) #complexDataType | identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')? #primitiveDataType ; Should I file a JIRA task for this? [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/ src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation. scala?utf8=%E2%9C%93#L116-L118 [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/ src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils. scala#L785-L788 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Jul 16, 2018 at 4:27 PM, Joseph Torres wrote: > I guess the question is partly about the semantics of > DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will > definitely have exactly this schema", that doesn't quite match the behavior > of the customSchema option. If it's only meant to be an arbitrary schema > input which the source can interpret however it wants, it'd be fine. > > The second semantic is IMO more useful, so I'm in favor here. > > On Mon, Jul 16, 2018 at 3:43 AM, Jacek Laskowski wrote: > >> Hi, >> >> I think there is a sort of inconsistency in how DataFrameReader.jdbc >> deals with a user-defined schema as it makes sure that there's no >> user-specified schema [1][2] yet allows for setting one using customSchema >> option [3]. Why is so? Has this been merely overlooked or similar? >> >> I think assertNoSpecifiedSchema should be removed from >> DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should >> be added (with the customSchema option marked as deprecated to be removed >> in 2.4 or 3.0). >> >> Should I file an issue in Spark JIRA and do the changes? WDYT? >> >> [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/src >> /main/scala/org/apache/spark/sql/DataFrameReader.scala? >> utf8=%E2%9C%93#L249 >> [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/src >> /main/scala/org/apache/spark/sql/DataFrameReader.scala? >> utf8=%E2%9C%93#L320 >> [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/src >> /main/scala/org/apache/spark/sql/execution/datasources/ >> jdbc/JDBCOptions.scala#L167 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> > >
JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?
Hi, I think there is a sort of inconsistency in how DataFrameReader.jdbc deals with a user-defined schema as it makes sure that there's no user-specified schema [1][2] yet allows for setting one using customSchema option [3]. Why is so? Has this been merely overlooked or similar? I think assertNoSpecifiedSchema should be removed from DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should be added (with the customSchema option marked as deprecated to be removed in 2.4 or 3.0). Should I file an issue in Spark JIRA and do the changes? WDYT? [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala?utf8=%E2%9C%93#L249 [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala?utf8=%E2%9C%93#L320 [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala#L167 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: [ANNOUNCE] Announcing Apache Spark 2.3.1
Hi Marcelo, How to announce it on twitter @ https://twitter.com/apachespark? How to make it part of the release process? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Jun 11, 2018 at 9:47 PM, Marcelo Vanzin wrote: > We are happy to announce the availability of Spark 2.3.1! > > Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3 > maintenance branch of Spark. We strongly recommend all 2.3.x users to > upgrade to this stable release. > > To download Spark 2.3.1, head over to the download page: > http://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-2-3-1.html > > We would like to acknowledge all community members for contributing to > this release. This release would not have been possible without you. > > > -- > Marcelo > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?
Yay! That's right!!! Thanks Reynold. Such a short answer with so much information. Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Wed, May 30, 2018 at 8:10 PM, Reynold Xin wrote: > SQL expressions? > > On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski wrote: > >> Hi, >> >> I've been exploring RuntimeReplaceable expressions [1] and have been >> wondering what their purpose is. >> >> Quoting the scaladoc [2]: >> >> > An expression that gets replaced at runtime (currently by the >> optimizer) into a different expression for evaluation. This is mainly used >> to provide compatibility with other databases. >> >> For example, ParseToTimestamp expression is a RuntimeReplaceable >> expression and it is replaced by Cast(left, TimestampType) >> or Cast(UnixTimestamp(left, format), TimestampType) per to_timestamp >> function (there are two variants). >> >> My question is why is this RuntimeReplaceable better than simply using >> the Casts as the implementation of to_timestamp functions? >> >> def to_timestamp(s: Column, fmt: String): Column = withExpr { >> // pseudocode >> Cast(UnixTimestamp(left, format), TimestampType) >> } >> >> What's wrong with the above implementation compared to the current one? >> >> [1] https://github.com/apache/spark/blob/master/sql/ >> catalyst/src/main/scala/org/apache/spark/sql/catalyst/ >> expressions/Expression.scala#L275 >> >> [2] https://github.com/apache/spark/blob/master/sql/ >> catalyst/src/main/scala/org/apache/spark/sql/catalyst/ >> expressions/Expression.scala#L266-L267 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >
[SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?
Hi, I've been exploring RuntimeReplaceable expressions [1] and have been wondering what their purpose is. Quoting the scaladoc [2]: > An expression that gets replaced at runtime (currently by the optimizer) into a different expression for evaluation. This is mainly used to provide compatibility with other databases. For example, ParseToTimestamp expression is a RuntimeReplaceable expression and it is replaced by Cast(left, TimestampType) or Cast(UnixTimestamp(left, format), TimestampType) per to_timestamp function (there are two variants). My question is why is this RuntimeReplaceable better than simply using the Casts as the implementation of to_timestamp functions? def to_timestamp(s: Column, fmt: String): Column = withExpr { // pseudocode Cast(UnixTimestamp(left, format), TimestampType) } What's wrong with the above implementation compared to the current one? [1] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L275 [2] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L266-L267 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: [SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)
Thanks Herman! You've helped me a lot (and am going to use your fine explanation in my gitbook quoting when needed! :)) p.s. I also found this today that helped me too --> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html What about tests? Don't you think the method should have tests? I'm writing small code snippets anyway while exploring it and have been wondering how to contribute it back to the Spark project given the method is private. It looks like I should instead be focusing on the methods of Expression or even QueryPlan to understand the various methods (as that's what triggered my question). Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, May 28, 2018 at 11:33 AM, Herman van Hövell tot Westerflier < her...@databricks.com> wrote: > Hi Jacek, > > RewriteCorrelatedScalarSubquery rewrites a plan containing a scalar > subquery into a left join and a projection/filter/aggregate. > > For example: > SELECT a_id, >(SELECT MAX(value) > FROM tbl_b > WHERE tbl_b.b_id = tbl_a.a_id) AS max_value > FROM tbl_a > > Will be rewritten into something like this: > SELECT a_id, >agg.max_value > FROM tbl_a > JOIN (SELECT b_id, > MAX(value) AS max_value > FROM tbl_b > GROUP BY b_id) agg >ON agg.b_id = tbl_a.a_id > > The particular function you refer to extracts all correlated scalar > subqueries by adding them to the subqueries buffer and rewrites the > expression to use output of the subquery (the s.plan.output.head bit). > This returned expression then replaces the original expression in the > operator (Aggregate, Project or Filter) it came from, completing the > rewrite for that operator. The extracted subqueries are planned as LEFT > OUTER JOINS below the operator. > > I hope this makes sense. > > Herman > > > > > > On Sun, May 27, 2018 at 9:43 PM Jacek Laskowski wrote: > >> Hi, >> >> I'm trying to understand RewriteCorrelatedScalarSubquery optimization >> and how extractCorrelatedScalarSubqueries [1] works. I don't understand >> how "The expression is rewritten and returned." is done. How is the >> expression rewritten? >> >> Since it's private it's not even possible to write tests and that got me >> thinking how you go about code like this? How do you know whether it works >> fine or not? Any help? I'd appreciate. >> >> [1] https://github.com/apache/spark/blob/branch-2.3/sql/ >> catalyst/src/main/scala/org/apache/spark/sql/catalyst/ >> optimizer/subquery.scala?utf8=%E2%9C%93#L290-L299 >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >
[SQL] Two ScalarSubquery expressions?! Could we have ScalarSubqueryExec instead?
Hi, Just been bitten by finding out that there are two ScalarSubquery expressions, one is a SubqueryExpression [1] and another is a ExecSubqueryExpression [2]. Could we have the ExecSubqueryExpression subclass called ScalarSubqueryExec to follow the naming convention in logical and physical plans? That'd help a lot (and I doubt I'm the only one confused). Thanks. [1] https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala?utf8=%E2%9C%93#L246 [2] https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala?utf8=%E2%9C%93#L46 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
[SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)
Hi, I'm trying to understand RewriteCorrelatedScalarSubquery optimization and how extractCorrelatedScalarSubqueries [1] works. I don't understand how "The expression is rewritten and returned." is done. How is the expression rewritten? Since it's private it's not even possible to write tests and that got me thinking how you go about code like this? How do you know whether it works fine or not? Any help? I'd appreciate. [1] https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala?utf8=%E2%9C%93#L290-L299 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
Re: Spark version for Mesos 0.27.0
Hi, Mesos 0.27.0?! That's been a while. I'd search for the changes to pom.xml and see when the mesos dependency version changed. That'd give you the most precise answer. I think it could've been 1.5 or older. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Fri, May 25, 2018 at 1:29 PM, Thodoris Zois wrote: > Hello, > > Could you please tell me which version of Spark works with Apache Mesos > version 0.27.0? (I cannot find anything on docs at github) > > Thank you very much, > Thodoris Zois > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Repeated FileSourceScanExec.metrics from ColumnarBatchScan.metrics
Hi, I'm wondering why are the metrics repeated in FileSourceScanExec.metrics [1] since it is a ColumnarBatchScan [2] and so inherits the two metrics numOutputRows and scanTime from ColumnarBatchScan.metrics [3]. Shouldn't FileSourceScanExec.metrics be as follows then: override lazy val metrics = super.metrics ++ Map( "numFiles" -> SQLMetrics.createMetric(sparkContext, "number of files"), "metadataTime" -> SQLMetrics.createMetric(sparkContext, "metadata time (ms)")) I'd like to send a pull request with a fix if no one objects. Anyone? [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L315-L319 [2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L164 [3] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L38-L40 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
InMemoryTableScanExec.inputRDD and buffers (RDD[CachedBatch])
Hi, Is there any reason why InMemoryTableScanExec.inputRDD does not use the buffers local value [1] for the non-batch case [2]? Just curious as I ran into it and thought I'd do a tiny refactoring. [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala?utf8=%E2%9C%93#L105 [2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala?utf8=%E2%9C%93#L125 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski