from:"Jacek Laskowski"

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-22 Thread Jacek Laskowski

+1

Builds and runs fine on Java 17, macOS.

$ ./dev/change-scala-version.sh 2.13
$ mvn \
-Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano,connect
\
-DskipTests \
clean install

$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session
SparkSession.sql'
...
Tests passed in 28 second

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Tue, Jun 20, 2023 at 4:41 AM Dongjoon Hyun  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.4.1.
>
> The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.4.1-rc1 (commit
> 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
> https://github.com/apache/spark/tree/v3.4.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1443/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
>
> The list of bug fixes going into 3.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12352874
>
> This release is using the release script of the tag v3.4.1-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.1?
> ===
>
> The current list of open tickets targeted at 3.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Jacek Laskowski

+0

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Jun 21, 2023 at 5:11 PM Amanda Liu 
wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: PySpark Test Framework.
>
> The high-level summary for the SPIP is that it proposes an official test
> framework for PySpark. Currently, there are only disparate open-source
> repos and blog posts for PySpark testing resources. We can streamline and
> simplify the testing process by incorporating test features, such as a
> PySpark Test Base class (which allows tests to share Spark sessions) and
> test util functions (for example, asserting dataframe and schema equality).
>
> *SPIP doc:*
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v
>
> *JIRA ticket:* https://issues.apache.org/jira/browse/SPARK-44042
>
> *Discussion thread:*
> https://lists.apache.org/thread/trwgbgn3ycoj8b8k8lkxko2hql23o41n
>
> Please vote on the SPIP for the next 72 hours:
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thank you!
>
> Best,
> Amanda Liu
>

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Jacek Laskowski

Hi Allison and devs,

Although I was against this idea at first sight (probably because I'm a
Scala dev), I think it could work as long as there are people who'd be
interested in such an API. Were there any? I'm just curious. I've seen no
emails requesting it.

I also doubt that Python devs would like to work on new data sources but
support their wishes wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 wrote:

> Hi everyone,
>
> I would like to start a discussion on “Python Data Source API”.
>
> This proposal aims to introduce a simple API in Python for Data Sources.
> The idea is to enable Python developers to create data sources without
> having to learn Scala or deal with the complexities of the current data
> source APIs. The goal is to make a Python-based API that is simple and easy
> to use, thus making Spark more accessible to the wider Python developer
> community. This proposed approach is based on the recently introduced
> Python user-defined table functions with extensions to support data sources.
>
> *SPIP Doc*:
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>
> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>
> Looking forward to your feedback.
>
> Thanks,
> Allison
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-10 Thread Jacek Laskowski

+1

* Built fine with Scala 2.13
and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
* Ran some demos on Java 17
* Mac mini / Apple M2 Pro / Ventura 13.3.1

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC7) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 12th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.4.0-rc7 (commit
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
> https://github.com/apache/spark/tree/v3.4.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1441
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread Jacek Laskowski

+1

Compiled on Java 17 with Scala 2.13 on macos and ran some basic code.

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Mar 30, 2023 at 10:21 AM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 4th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v3.4.0-rc5* (commit
> f39ad617d32a671e120464e4a75986241d72c487):
> https://github.com/apache/spark/tree/v3.4.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1439
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>
>

Re: starter tasks for new contributors

2023-03-17 Thread Jacek Laskowski

Hey Maxim,

Very great kudos for the idea!

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Mar 17, 2023 at 2:18 PM Maxim Gekk
 wrote:

> Hello Devs,
>
> If you want to contribute to Apache Spark but don't know where to start
> from, I would like to propose new starter tasks for you. All the tasks
> below in the umbrella JIRA
> https://issues.apache.org/jira/browse/SPARK-37935 aim to improve Spark
> error messages:
> - SPARK-42838: Assign a name to the error class _LEGACY_ERROR_TEMP_2000
> - SPARK-42839: Assign a name to the error class _LEGACY_ERROR_TEMP_2003
> - SPARK-42840: Assign a name to the error class _LEGACY_ERROR_TEMP_2004
> - SPARK-42841: Assign a name to the error class _LEGACY_ERROR_TEMP_2005
> - SPARK-42842: Assign a name to the error class _LEGACY_ERROR_TEMP_2006
> - SPARK-42843: Assign a name to the error class _LEGACY_ERROR_TEMP_2007
> - SPARK-42844: Assign a name to the error class _LEGACY_ERROR_TEMP_2008
> - SPARK-42845: Assign a name to the error class _LEGACY_ERROR_TEMP_2010
> - SPARK-42846: Assign a name to the error class _LEGACY_ERROR_TEMP_2011
> - SPARK-42847: Assign a name to the error class _LEGACY_ERROR_TEMP_2013
>
> Improving Spark errors makes you more familiar with Spark code base, and
> you will impact on user experience with Spark SQL directly.
>
> If you have any questions, please, leave comments in the JIRAs, and feel
> free to ping me (and other contributors) in your PRs.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Jacek Laskowski

Yoohoo! Thanks Yuming for driving this release. A tiny step for Spark a
huge one for my clients (who still are on 3.2.1 or even older :))

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Oct 26, 2022 at 8:22 AM Yuming Wang  wrote:

> We are happy to announce the availability of Apache Spark 3.3.1!
>
> Spark 3.3.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.3 maintenance branch of Spark. We strongly
> recommend all 3.3 users to upgrade to this stable release.
>
> To download Spark 3.3.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>
>

Re: [VOTE] Release Spark 3.2.0 (RC6)

2021-09-30 Thread Jacek Laskowski

Hi,

I don't want to hijack the voting thread but given I faced
https://issues.apache.org/jira/browse/SPARK-36904 in RC6 I wonder if it's
-1.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Sep 29, 2021 at 10:28 PM Mridul Muralidharan 
wrote:

>
> Yi Wu helped identify an issue
> <https://issues.apache.org/jira/browse/SPARK-36892> which causes
> correctness (duplication) and hangs - waiting for validation to complete
> before submitting a patch.
>
> Regards,
> Mridul
>
> On Wed, Sep 29, 2021 at 11:34 AM Holden Karau 
> wrote:
>
>> PySpark smoke tests pass, I'm going to do a last pass through the JIRAs
>> before my vote though.
>>
>> On Wed, Sep 29, 2021 at 8:54 AM Sean Owen  wrote:
>>
>>> +1 looks good to me as before, now that a few recent issues are resolved.
>>>
>>>
>>> On Tue, Sep 28, 2021 at 10:45 AM Gengliang Wang 
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as
>>>> Apache Spark version 3.2.0.
>>>>
>>>> The vote is open until 11:59pm Pacific time September 30 and passes if
>>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v3.2.0-rc6 (commit
>>>> dde73e2e1c7e55c8e740cb159872e081ddfa7ed6):
>>>> https://github.com/apache/spark/tree/v3.2.0-rc6
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1393
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-docs/
>>>>
>>>> The list of bug fixes going into 3.2.0 can be found at the following
>>>> URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>>
>>>> This release is using the release script of the tag v3.2.0-rc6.
>>>>
>>>>
>>>> FAQ
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with a out of date RC going forward).
>>>>
>>>> ===
>>>> What should happen to JIRA tickets still targeting 3.2.0?
>>>> ===
>>>> The current list of open tickets targeted at 3.2.0 can be found at:
>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.2.0
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>> be worked on immediately. Everything else please retarget to an
>>>> appropriate release.
>>>>
>>>> ==
>>>> But my bug isn't fixed?
>>>> ==
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from the previous
>>>> release. That being said, if there is something which is a regression
>>>> that has not been correctly targeted please ping me or a committer to
>>>> help target the issue.
>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

v3.2.0-rc6 and org.postgresql.Driver was not found in the CLASSPATH

2021-09-30 Thread Jacek Laskowski

Hi,

Just ran a freshly-built 3.2.0 RC6 and faced an issue (that seems to be
reported earlier on SO):

> The specified datastore driver ("org.postgresql.Driver") was not found in
the CLASSPATH

More details in https://issues.apache.org/jira/browse/SPARK-36904

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [SQL][AQE] Advice needed: a trivial code change with a huge reading impact?

2021-09-08 Thread Jacek Laskowski

Salut Sean !

Merci beaucoup mon ami Sean ! That's exactly an answer I hoped for. Thank
you!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Sep 8, 2021 at 1:53 PM Sean Owen  wrote:

> That does seem pointless. The body could just be .flatten()-ed to achieve
> the same result. Maybe it was just written that way for symmetry with the
> block above. You could open a PR to change it.
>
> On Wed, Sep 8, 2021 at 4:31 AM Jacek Laskowski  wrote:
>
>> Hi Spark Devs,
>>
>> I'm curious what your take on this code [1] would be if you were me
>> trying to understand it:
>>
>>   (0 until 1).flatMap { _ =>
>> (splitPoints :+ numMappers).sliding(2).map {
>>   case Seq(start, end) => CoalescedMapperPartitionSpec(start,
>> end, numReducers)
>> }
>>   }
>>
>> There's something important going on here but it's so convoluted that my
>> Scala coding skills seem not enough (not to mention AQE skills themselves).
>>
>> I'm tempted to change (0 until 1) to Seq(0), but Seq(0).flatMap feels
>> awkward too. Is this Seq(0).flatMap even needed?! Even with no splitPoints
>> we've got numReducers > 0.
>>
>> Looks like the above is as simple as
>>
>> (splitPoints :+ numMappers).sliding(2).map {
>>   case Seq(start, end) => CoalescedMapperPartitionSpec(start,
>> end, numReducers)
>>  }
>>
>> Correct?
>>
>> I'm mentally used up and can't seem to think straight. Would a PR with
>> such a change be acceptable? (Sean I'm looking at you :D)
>>
>> [1]
>> https://github.com/apache/spark/blob/8d817dcf3084d56da22b909d578a644143f775d5/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala#L89-L93
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>

[SQL][AQE] Advice needed: a trivial code change with a huge reading impact?

2021-09-08 Thread Jacek Laskowski

Hi Spark Devs,

I'm curious what your take on this code [1] would be if you were me trying
to understand it:

  (0 until 1).flatMap { _ =>
(splitPoints :+ numMappers).sliding(2).map {
  case Seq(start, end) => CoalescedMapperPartitionSpec(start, end,
numReducers)
}
  }

There's something important going on here but it's so convoluted that my
Scala coding skills seem not enough (not to mention AQE skills themselves).

I'm tempted to change (0 until 1) to Seq(0), but Seq(0).flatMap feels
awkward too. Is this Seq(0).flatMap even needed?! Even with no splitPoints
we've got numReducers > 0.

Looks like the above is as simple as

(splitPoints :+ numMappers).sliding(2).map {
  case Seq(start, end) => CoalescedMapperPartitionSpec(start, end,
numReducers)
 }

Correct?

I'm mentally used up and can't seem to think straight. Would a PR with such
a change be acceptable? (Sean I'm looking at you :D)

[1]
https://github.com/apache/spark/blob/8d817dcf3084d56da22b909d578a644143f775d5/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala#L89-L93

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false

2021-09-07 Thread Jacek Laskowski

Thanks Wenchen. If it's ever asked on SO I'm simply gonna quote you :)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Tue, Sep 7, 2021 at 6:58 AM Wenchen Fan  wrote:

> This is correct. It's true by default so that AQE doesn't have performance
> regression. If you run a benchmark, larger parallelism usually means better
> performance. However, it's recommended to set it to false, so that AQE can
> give better resource utilization, which is good for a busy Spark cluster.
>
> On Fri, Sep 3, 2021 at 7:33 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Found this new spark.sql.adaptive.coalescePartitions.parallelismFirst
>> config property [1] with the default value `true` but the descriptions says
>> the opposite:
>>
>> > It's recommended to set this config to false
>>
>> Is this OK and I'm misreading it?
>>
>> [1]
>> https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L519-L530
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>

[SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false

2021-09-03 Thread Jacek Laskowski

Hi,

Found this new spark.sql.adaptive.coalescePartitions.parallelismFirst
config property [1] with the default value `true` but the descriptions says
the opposite:

> It's recommended to set this config to false

Is this OK and I'm misreading it?

[1]
https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L519-L530

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

[SQL] When SQLConf vals gets own accessor defs?

2021-09-03 Thread Jacek Laskowski

Hi,

Just found something I'd consider an inconsistency in how SQLConf constants
(vals) get their own accessor method for the current value (defs).

I thought that internal config prop vals might not have defs (and that
would make sense) but
DYNAMIC_PARTITION_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO [1]
(with SQLConf.dynamicPartitionPruningPruningSideExtraFilterRatio [2]) seems
to contradict it.

On the other hand, ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD [3] that's new in
3.2.0 and seems important does not have a corresponding def to access the
current value. Why?

[1]
https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L344
[2]
https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3532
[3]
https://github.com/apache/spark/blob/54cca7f82ecf23e062bb4f6d68697abec2dbcc5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L638

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-24 Thread Jacek Laskowski

Hi Yi Wu,

Looks like the issue has got resolution: Won't Fix. How about your -1?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Mon, Aug 23, 2021 at 4:58 AM Yi Wu  wrote:

> -1. I found a bug (https://issues.apache.org/jira/browse/SPARK-36558) in
> the push-based shuffle, which could lead to job hang.
>
> Bests,
> Yi
>
> On Sat, Aug 21, 2021 at 1:05 AM Gengliang Wang  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.2.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.2.0-rc1 (commit
>> 6bb3523d8e838bd2082fb90d7f3741339245c044):
>> https://github.com/apache/spark/tree/v3.2.0-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1388
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/
>>
>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>
>> This release is using the release script of the tag v3.2.0-rc1.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.0?
>> ===
>> The current list of open tickets targeted at 3.2.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.2.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-22 Thread Jacek Laskowski

Hi Gengliang,

Yay! Thank you! Java 11 with the following MAVEN_OPTS worked fine:

$ echo $MAVEN_OPTS
-Xss64m -Xmx4g -XX:ReservedCodeCacheSize=1g

$ ./build/mvn \
-Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
-DskipTests \
clean install
...
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time:  22:02 min
[INFO] Finished at: 2021-08-22T13:09:25+02:00
[INFO]


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sun, Aug 22, 2021 at 12:45 PM Jacek Laskowski  wrote:

> Hi Gengliang,
>
> With Java 8 the build worked fine. No other changes. I'm going to give
> Java 11 a try with the options you mentioned.
>
> $ java -version
> openjdk version "1.8.0_292"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
>
> BTW, Shouldn't the page [1] be updated to reflect this? This is what I
> followed.
>
> [1]
> https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
>
> Thanks
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Sun, Aug 22, 2021 at 8:29 AM Gengliang Wang  wrote:
>
>> Hi Jacek,
>>
>> The current GitHub action CI for Spark contains Java 11 build. The build
>> is successful with the options "-Xss64m -Xmx2g
>> -XX:ReservedCodeCacheSize=1g":
>>
>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L506
>> The default Java stack size is small and we have to raise it for Spark
>> build with the option "-Xss64m".
>>
>> On Sat, Aug 21, 2021 at 9:33 PM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I've been building the tag and I'm facing the
>>> following StackOverflowError:
>>>
>>> Exception in thread "main" java.lang.StackOverflowError
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at
>>> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
>>> at scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:280)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1430)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1409)
>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>>> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>>> at
>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>>> at
>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>>> ...
>>>
>>> The command I use:
>>>
>>> ./build/mvn \
>>> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
>>>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-22 Thread Jacek Laskowski

Hi Gengliang,

With Java 8 the build worked fine. No other changes. I'm going to give Java
11 a try with the options you mentioned.

$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

BTW, Shouldn't the page [1] be updated to reflect this? This is what I
followed.

[1]
https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage

Thanks
Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sun, Aug 22, 2021 at 8:29 AM Gengliang Wang  wrote:

> Hi Jacek,
>
> The current GitHub action CI for Spark contains Java 11 build. The build
> is successful with the options "-Xss64m -Xmx2g
> -XX:ReservedCodeCacheSize=1g":
>
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L506
> The default Java stack size is small and we have to raise it for Spark
> build with the option "-Xss64m".
>
> On Sat, Aug 21, 2021 at 9:33 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I've been building the tag and I'm facing the
>> following StackOverflowError:
>>
>> Exception in thread "main" java.lang.StackOverflowError
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>> at
>> scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
>> at scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:280)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:133)
>> at scala.reflect.internal.Trees.itransform(Trees.scala:1430)
>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>> at
>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>> at scala.reflect.internal.Trees.itransform(Trees.scala:1409)
>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>> at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
>> at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
>> at
>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
>> at
>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
>> ...
>>
>> The command I use:
>>
>> ./build/mvn \
>> -Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
>> -DskipTests \
>> clean install
>>
>> $ java --version
>> openjdk 11.0.11 2021-04-20
>> OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
>> OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed
>> mode)
>>
>> $ ./build/mvn -v
>> Using `mvn` from path: /usr/local/bin/mvn
>> Apache Maven 3.8.1 (05c21c65bdfed0f71a2f2ada8b84da59348c4c5d)
>> Maven home: /usr/local/Cellar/maven/3.8.1/libexec
>> Java version: 11.0.11, vendor: AdoptOpenJDK, runtime:
>> /Users/jacek/.sdkman/candidates/java/11.0.11.hs-adpt
>> Default locale: en_PL, platform encoding: UTF-8
>> OS name: "mac os x", version: "11.5", arch: "x86_64", family: "mac"
>>
>> $ echo $MAVEN_OPTS
>> -Xmx8g -XX:ReservedCodeCacheSize=1g
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>> On Fri, Aug 20, 2021 at 7:05 PM Gengliang

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-21 Thread Jacek Laskowski

Hi,

I've been building the tag and I'm facing the following StackOverflowError:

Exception in thread "main" java.lang.StackOverflowError
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
at
scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
at scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:280)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transformStats(ExtensionMethods.scala:133)
at scala.reflect.internal.Trees.itransform(Trees.scala:1430)
at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
at
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
at scala.reflect.internal.Trees.itransform(Trees.scala:1409)
at scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
at scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
at scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
at
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275)
at
scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133)
...

The command I use:

./build/mvn \
-Pyarn,kubernetes,hadoop-cloud,hive,hive-thriftserver \
-DskipTests \
clean install

$ java --version
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed
mode)

$ ./build/mvn -v
Using `mvn` from path: /usr/local/bin/mvn
Apache Maven 3.8.1 (05c21c65bdfed0f71a2f2ada8b84da59348c4c5d)
Maven home: /usr/local/Cellar/maven/3.8.1/libexec
Java version: 11.0.11, vendor: AdoptOpenJDK, runtime:
/Users/jacek/.sdkman/candidates/java/11.0.11.hs-adpt
Default locale: en_PL, platform encoding: UTF-8
OS name: "mac os x", version: "11.5", arch: "x86_64", family: "mac"

$ echo $MAVEN_OPTS
-Xmx8g -XX:ReservedCodeCacheSize=1g

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

On Fri, Aug 20, 2021 at 7:05 PM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.0.
>
> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc1 (commit
> 6bb3523d8e838bd2082fb90d7f3741339245c044):
> https://github.com/apache/spark/tree/v3.2.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1388
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make

TreeNode.exists?

2021-08-11 Thread Jacek Laskowski

Hi,

It's been a couple of times already when I ran into a code like the
following ([1]):

val isCommand = plan.find {
  case _: Command | _: ParsedStatement | _: InsertIntoDir => true
  case _ => false
}.isDefined

I think that this and the other places beg (scream?) for TreeNode.exists
that could do the simplest thing possible:

  find(f).isDefined

or even

  collectFirst(...).isDefined

It'd surely help a lot for people like myself reading the code. WDYT?

[1]
https://github.com/apache/spark/pull/33671/files#diff-4d16a733f8741de9a4b839ee7c356c3e9b439b4facc70018f5741da1e930c6a8R51-R54

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-02 Thread Jacek Laskowski

Big shout-out to you, Dongjoon! Thank you.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Jun 2, 2021 at 2:59 AM Dongjoon Hyun 
wrote:

> We are happy to announce the availability of Spark 3.1.2!
>
> Spark 3.1.2 is a maintenance release containing stability fixes. This
> release is based on the branch-3.1 maintenance branch of Spark. We strongly
> recommend all 3.1 users to upgrade to this stable release.
>
> To download Spark 3.1.2, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-2.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>

Should AggregationIterator.initializeBuffer be moved down to SortBasedAggregationIterator?

2021-05-25 Thread Jacek Laskowski

Hi,

Just found out that the only purpose
of AggregationIterator.initializeBuffer is to
keep SortBasedAggregationIterator happy [1].

Shouldn't it be moved down to SortBasedAggregationIterator to make things
clear(er)?

[1] https://github.com/apache/spark/search?q=initializeBuffer

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Purpose of OffsetHolder as a LeafNode?

2021-05-15 Thread Jacek Laskowski

Hi,

Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode?
What logical plan could it be part of?

[1]
https://github.com/apache/spark/blob/1a6708918b32e821bff26a00d2d8b7236b29515f/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L633

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: Welcoming six new Apache Spark committers

2021-03-30 Thread Jacek Laskowski

Hi,

Congrats to all of you committers! Wishing you all the best (commits)!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Mar 26, 2021 at 9:21 PM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-19 Thread Jacek Laskowski

Hi Hyukjin,

FYI: cloud-fan commented 3 hours ago: thanks, merging to master/3.1!

https://github.com/apache/spark/pull/31550#issuecomment-781977920

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sat, Feb 13, 2021 at 3:46 AM Hyukjin Kwon  wrote:

> I have received a ping about a new blocker, a regression on a temporary
> function in CTE - worked before but now it's broken (
> https://github.com/apache/spark/pull/31550). Thank you @Peter Toth
> 
> I tend to treat this as a legitimate blocker. I will cut another RC right
> after this fix if we're all good with it.
>
> 2021년 2월 11일 (목) 오전 9:20, Takeshi Yamamuro 님이 작성:
>
>> +1
>>
>> I looked around the jira tickets and I think there is no explicit blocker
>> issue on the Spark SQL component.
>> Also, I ran the tests on AWS envs and I couldn't find any issue there,
>> too.
>>
>> Bests,
>> Takeshi
>>
>> On Thu, Feb 11, 2021 at 7:37 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>
>>> I keep getting test failures
>>> with org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite: removing this
>>> suite gets the build through though - does anyone have suggestions on how
>>> to fix it ?
>>> Perhaps a local problem at my end ?
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 6:24 PM Hyukjin Kwon  wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 3.1.1.
>>>>
>>>> The vote is open until February 15th 5PM PST and passes if a majority
>>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> Note that it is 7 days this time because it is a holiday season in
>>>> several countries including South Korea (where I live), China etc., and I
>>>> would like to make sure people do not miss it because it is a holiday
>>>> season.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v3.1.1-rc2 (commit
>>>> cf0115ac2d60070399af481b14566f33d22ec45e):
>>>> https://github.com/apache/spark/tree/v3.1.1-rc2
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> <https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/>
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1365
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/
>>>>
>>>> The list of bug fixes going into 3.1.1 can be found at the following
>>>> URL:
>>>> https://s.apache.org/41kf2
>>>>
>>>> This release is using the release script of the tag v3.1.1-rc2.
>>>>
>>>> FAQ
>>>>
>>>> ===
>>>> What happened to 3.1.0?
>>>> ===
>>>>
>>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>>> it was discussed and decided to skip 3.1.0.
>>>> Please see
>>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>>> more details.
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>>
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you

Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Jacek Laskowski

Hi,

I'm "okay to add RocksDB StateStore as external module". See no reason not
to.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:

> Hi devs,
>
> In Spark structured streaming, we need state store for state management for
> stateful operators such streaming aggregates, joins, etc. We have one and
> only one state store implementation now. It is in-memory hashmap which was
> backed up in HDFS complaint file system at the end of every micro-batch.
>
> As it basically uses in-memory map to store states, memory consumption is a
> serious issue and state store size is limited by the size of the executor
> memory. Moreover, state store using more memory means it may impact the
> performance of task execution that requires memory too.
>
> Internally we see more streaming applications that requires large state in
> stateful operations. For such requirements, we need a StateStore not rely
> on
> memory to store states.
>
> This seems to be also true externally as several other major streaming
> frameworks already use RocksDB for state management. RocksDB is an embedded
> DB and streaming engines can use it to store state instead of memory
> storage.
>
> So seems to me, it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
>
> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> maintenance cost and it introduces RocksDB dependency.
>
> For the first concern, as more users require to use the feature, it should
> be highly used code in SS and more developers will look at it. For second
> one, we propose (SPARK-34198) to add it as an external module to relieve
> the
> dependency concern.
>
> Because it was pushed back previously, I'm going to raise this discussion
> to
> know what people think about it now, in advance of submitting any code.
>
> I think there might be some possible opinions:
>
> 1. okay to add RocksDB StateStore into sql core module
> 2. not okay for 1, but okay to add RocksDB StateStore as external module
> 3. either 1 or 2 is okay
> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> external module
>
> Please let us know if you have some thoughts.
>
> Thank you.
>
> Liang-Chi Hsieh
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: How to contribute the code

2021-02-01 Thread Jacek Laskowski

Hi,

http://spark.apache.org/contributing.html ?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sat, Jan 30, 2021 at 3:55 PM hammerCS <1624067...@qq.com> wrote:

> I have learned Spark for one year .And now, I want to get the hang of spark
> by contributing code ..
> But the code is so massive , I don't know how to do.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[K8S] ExecutorPodsWatchSnapshotSource with no spark-exec-inactive label in 3.1?

2021-01-23 Thread Jacek Laskowski

Hi,

spark-exec-inactive label is not used to watch executor pods
in ExecutorPodsWatchSnapshotSource [1] as it happens in PollRunnable [2].

Any reasons? Am I correct to consider it a bug?

[1]
https://github.com/apache/spark/blob/270a9a3cac25f3e799460320d0fc94ccd7ecfaea/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L39-L42
[2]
https://github.com/apache/spark/blob/ee624821a903a263c844ed849d6833df8e9ad43e/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala#L62

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: Should 3.1.0 config props be 3.1.1 (as s.k.executor.missingPodDetectDelta)?

2021-01-23 Thread Jacek Laskowski

Hi Hyukjin,

Agreed. I asked to see if I'm not missing anything. Thank you.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sat, Jan 23, 2021 at 1:01 PM Hyukjin Kwon  wrote:

> I think we can just leave it as is. We have the unofficial 3.1.0 release
> with its corresponding git tag so 3.1.0 mark isn't completely useless.
>
> Also changing means we should go through JIRAs and change the version
> properties, fixing conflict, etc. which I don't think is worthwhile.
>
> On Sat, 23 Jan 2021, 20:43 Jacek Laskowski,  wrote:
>
>> Hi,
>>
>> Just noticed a couple of configuration properties marked
>> as version("3.1.0") in Spark on Kubernetes' Config. There's one
>> spark.kubernetes.executor.missingPodDetectDelta property marked as 3.1.1.
>>
>> That triggered this question in my mind whether all together should be
>> 3.1.1?
>>
>> (We could leave it as is as an "easter egg"-like thing too)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>

Should 3.1.0 config props be 3.1.1 (as s.k.executor.missingPodDetectDelta)?

2021-01-23 Thread Jacek Laskowski

Hi,

Just noticed a couple of configuration properties marked
as version("3.1.0") in Spark on Kubernetes' Config. There's one
spark.kubernetes.executor.missingPodDetectDelta property marked as 3.1.1.

That triggered this question in my mind whether all together should be
3.1.1?

(We could leave it as is as an "easter egg"-like thing too)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-20 Thread Jacek Laskowski

Hi,

+1 (non-binding)

1. Built locally using AdoptOpenJDK (build 11.0.9+11) with
-Pyarn,kubernetes,hive-thriftserver,scala-2.12 -DskipTests
2. Ran batch and streaming demos using Spark on Kubernetes (minikube) using
spark-shell (client deploy mode) and spark-submit --deploy-mode cluster

I reported a non-blocking issue with "the only developer Matei" (
https://issues.apache.org/jira/browse/SPARK-34158)

Found a minor non-blocking (but annoying) issue in Spark on k8s that's
different from 3.0.1 that should really be silenced as the other debug
message in ExecutorPodsAllocator:

21/01/19 12:23:26 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod
allocation status: 2 running, 0 pending. 0 unacknowledged.
21/01/19 12:23:27 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod
allocation status: 2 running, 0 pending. 0 unacknowledged.
21/01/19 12:23:28 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod
allocation status: 2 running, 0 pending. 0 unacknowledged.
21/01/19 12:23:29 DEBUG ExecutorPodsAllocator: ResourceProfile Id: 0 pod
allocation status: 2 running, 0 pending. 0 unacknowledged.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Mon, Jan 18, 2021 at 1:06 PM Hyukjin Kwon  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.1.
>
> The vote is open until January 22nd 4PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.1.1-rc1 (commit
> 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):
> https://github.com/apache/spark/tree/v3.1.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1364
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/
>
> The list of bug fixes going into 3.1.1 can be found at the following URL:
> https://s.apache.org/41kf2
>
> This release is using the release script of the tag v3.1.1-rc1.
>
> FAQ
>
> ===
> What happened to 3.1.0?
> ===
>
> There was a technical issue during Apache Spark 3.1.0 preparation, and it
> was discussed and decided to skip 3.1.0.
> Please see
> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
> more details.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.1?
> ===
>
> The current list of open tickets targeted at 3.1.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>

[K8S] KUBERNETES_EXECUTOR_REQUEST_CORES

2021-01-12 Thread Jacek Laskowski

Hi,

I'm curious why this line [1] uses sparkConf to
find KUBERNETES_EXECUTOR_REQUEST_CORES while the next [2] goes
to kubernetesConf?

(I'm not going to mention it's getOrElse case here and wait till I got OK
to change this due to the above "misuse").

[1]
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L71
[2]
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L72

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-07 Thread Jacek Laskowski

Hi Sean,

+1 to leave it. Makes so much more sense (as that's what really happened
and the history of Apache Spark is...irreversible).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Jan 7, 2021 at 3:44 PM Sean Owen  wrote:

> While we can delete the tag, maybe just leave it. As a general rule we
> would not remove anything pushed to the main git repo.
>
> On Thu, Jan 7, 2021 at 8:31 AM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> BTW, wondering aloud. Since it was agreed to skip 3.1.0 and go ahead with
>> 3.1.1, what's gonna happen with v3.1.0 tag [1]? Is it going away and we'll
>> see 3.1.1-rc1?
>>
>> [1] https://github.com/apache/spark/tree/v3.1.0-rc1
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>>

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-07 Thread Jacek Laskowski

Hi,

BTW, wondering aloud. Since it was agreed to skip 3.1.0 and go ahead with
3.1.1, what's gonna happen with v3.1.0 tag [1]? Is it going away and we'll
see 3.1.1-rc1?

[1] https://github.com/apache/spark/tree/v3.1.0-rc1

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Jan 7, 2021 at 3:12 PM Jacek Laskowski  wrote:

> Hi,
>
> I'm just reading this now. I'm for 3.1.1 with no 3.1.0 but the news that
> we're skipping that particular release. Gonna be more fun! :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Thu, Jan 7, 2021 at 6:13 AM Hyukjin Kwon  wrote:
>
>> Thank you Holden and Wenchen!
>>
>> Let me:
>> - prepare a PR for news in spark-website first about 3.1.0 accident late
>> tonight (in KST)
>> - and start to prepare 3.1.1 probably in few more days like next monday
>> in case other people have different thoughts
>>
>>
>>
>> 2021년 1월 7일 (목) 오후 2:04, Holden Karau 님이 작성:
>>
>>> I think that posting the 3.1.0 maven release was an accident and we're
>>> going to 3.1.1 RCs is the right step forward.
>>> I'd ask for maybe a day before cutting the 3.1.1 release, I think
>>> https://issues.apache.org/jira/browse/SPARK-34018 is also a blocker (at
>>> first I thought it was just a test issue, but Dongjoon pointed out the NPE
>>> happens in prod too).
>>>
>>> I'd also like to echo the: it's totally ok we all make mistakes
>>> especially in partially manual & partially automated environments, I've
>>> created a bunch of RCs labels without recognizing they were getting pushed
>>> automatically.
>>>
>>> On Wed, Jan 6, 2021 at 8:57 PM Wenchen Fan  wrote:
>>>
>>>> I agree with Jungtaek that people are likely to be biased when testing
>>>> 3.1.0. At least this will not be the same community-blessed release as
>>>> previous ones, because the voting is already affected by the fact that
>>>> 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me.
>>>>
>>>> On Thu, Jan 7, 2021 at 12:54 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Okay, let me just start to prepare 3.1.1. I think that will address
>>>>> all concerns except that 3.1.0 will remain in Maven as incomplete.
>>>>> By right, removal in the Maven repo is disallowed. Overwrite is
>>>>> possible as far as I know but other mirrors that maintain cache will get
>>>>> affected.
>>>>> Maven is one of the downstream publish channels, and we haven't
>>>>> officially announced and published it to Apache repo anyway.
>>>>> I will prepare to upload news in spark-website to explain that 3.1.0
>>>>> is incompletely published because there was something wrong during the
>>>>> release process, and we go to 3.1.1 right away.
>>>>> Are we all good with this?
>>>>>
>>>>>
>>>>>
>>>>> 2021년 1월 7일 (목) 오후 1:11, Hyukjin Kwon 님이 작성:
>>>>>
>>>>>> I think that It would be great though if we have a clear blocker that
>>>>>> makes the release pointless if we want to drop this RC practically given
>>>>>> that we will schedule 3.1.1 faster - non-regression bug fixes will be
>>>>>> delivered to end users relatively fast.
>>>>>> That would make it clear which option we should take. I personally
>>>>>> don't mind dropping 3.1.0 as well; we'll have to wait for the INFRA 
>>>>>> team's
>>>>>> response anyway.
>>>>>>
>>>>>>
>>>>>> 2021년 1월 7일 (목) 오후 1:03, Sean Owen 님이 작성:
>>>>>>
>>>>>>> I don't agree the first two are blockers for reasons I gave earlier.
>>>>>>> Those two do look like important issues - are they regressions from
>>>>>>> 3.0.1?
>>>>>>> I do agree we'd probably cut a new RC for those in any event, so
>>>>>>> agree with the plan to drop 3.1.0 (if the Maven release can't be
>>>>>>>

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-07 Thread Jacek Laskowski

Hi,

I'm just reading this now. I'm for 3.1.1 with no 3.1.0 but the news that
we're skipping that particular release. Gonna be more fun! :)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Jan 7, 2021 at 6:13 AM Hyukjin Kwon  wrote:

> Thank you Holden and Wenchen!
>
> Let me:
> - prepare a PR for news in spark-website first about 3.1.0 accident late
> tonight (in KST)
> - and start to prepare 3.1.1 probably in few more days like next monday in
> case other people have different thoughts
>
>
>
> 2021년 1월 7일 (목) 오후 2:04, Holden Karau 님이 작성:
>
>> I think that posting the 3.1.0 maven release was an accident and we're
>> going to 3.1.1 RCs is the right step forward.
>> I'd ask for maybe a day before cutting the 3.1.1 release, I think
>> https://issues.apache.org/jira/browse/SPARK-34018 is also a blocker (at
>> first I thought it was just a test issue, but Dongjoon pointed out the NPE
>> happens in prod too).
>>
>> I'd also like to echo the: it's totally ok we all make mistakes
>> especially in partially manual & partially automated environments, I've
>> created a bunch of RCs labels without recognizing they were getting pushed
>> automatically.
>>
>> On Wed, Jan 6, 2021 at 8:57 PM Wenchen Fan  wrote:
>>
>>> I agree with Jungtaek that people are likely to be biased when testing
>>> 3.1.0. At least this will not be the same community-blessed release as
>>> previous ones, because the voting is already affected by the fact that
>>> 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me.
>>>
>>> On Thu, Jan 7, 2021 at 12:54 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Okay, let me just start to prepare 3.1.1. I think that will address all
>>>> concerns except that 3.1.0 will remain in Maven as incomplete.
>>>> By right, removal in the Maven repo is disallowed. Overwrite is
>>>> possible as far as I know but other mirrors that maintain cache will get
>>>> affected.
>>>> Maven is one of the downstream publish channels, and we haven't
>>>> officially announced and published it to Apache repo anyway.
>>>> I will prepare to upload news in spark-website to explain that 3.1.0 is
>>>> incompletely published because there was something wrong during the release
>>>> process, and we go to 3.1.1 right away.
>>>> Are we all good with this?
>>>>
>>>>
>>>>
>>>> 2021년 1월 7일 (목) 오후 1:11, Hyukjin Kwon 님이 작성:
>>>>
>>>>> I think that It would be great though if we have a clear blocker that
>>>>> makes the release pointless if we want to drop this RC practically given
>>>>> that we will schedule 3.1.1 faster - non-regression bug fixes will be
>>>>> delivered to end users relatively fast.
>>>>> That would make it clear which option we should take. I personally
>>>>> don't mind dropping 3.1.0 as well; we'll have to wait for the INFRA team's
>>>>> response anyway.
>>>>>
>>>>>
>>>>> 2021년 1월 7일 (목) 오후 1:03, Sean Owen 님이 작성:
>>>>>
>>>>>> I don't agree the first two are blockers for reasons I gave earlier.
>>>>>> Those two do look like important issues - are they regressions from
>>>>>> 3.0.1?
>>>>>> I do agree we'd probably cut a new RC for those in any event, so
>>>>>> agree with the plan to drop 3.1.0 (if the Maven release can't be
>>>>>> overwritten)
>>>>>>
>>>>>> On Wed, Jan 6, 2021 at 9:38 PM Dongjoon Hyun 
>>>>>> wrote:
>>>>>>
>>>>>>> Before we discover the pre-uploaded artifacts, both Jungtaek and
>>>>>>> Hyukjin already made two blockers shared here.
>>>>>>> IIUC, it meant implicitly RC1 failure at that time.
>>>>>>>
>>>>>>> In addition to that, there are two correctness issues. So, I made up
>>>>>>> my mind to cast -1 for this RC1 before joining this thread.
>>>>>>>
>>>>>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
>>>>>>> (committed after tagging)
>>>>>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
>>>>>>> (PR is under review)
>>>>>>>
>>>>>>> Although the above issues are not regression, those are enough for
>>>>>>> me to give -1 for 3.1.0 RC1.
>>>>>>>
>>>>>>> On Wed, Jan 6, 2021 at 3:52 PM Sean Owen  wrote:
>>>>>>>
>>>>>>>> I just don't see a reason to believe there's a rush? just test it
>>>>>>>> as normal? I did, you can too, etc.
>>>>>>>> Or specifically what blocks the current RC?
>>>>>>>>
>>>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Jacek Laskowski

Hi,

I'm curious why Spark 3.1.0 is already available in repo1.maven.org?

https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.0/

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Jan 6, 2021 at 1:01 AM Hyukjin Kwon  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.0.
>
> The vote is open until January 8th 4PM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.1.0-rc1 (commit
> 97340c1e34cfd84de445b6b7545cfa466a1baaf6):
> https://github.com/apache/spark/tree/v3.1.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1363/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/
>
> The list of bug fixes going into 3.1.0 can be found at the following URL:
> https://s.apache.org/ldzzl
>
> This release is using the release script of the tag v3.1.0-rc1.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-bin/pyspark-3.1.0.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.0?
> ===
>
> The current list of open tickets targeted at 3.1.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>

Re: [3.0.1] ExecutorMonitor.onJobStart and StageInfo.shuffleDepId that's never used?

2020-12-30 Thread Jacek Laskowski

Hi,

Sorry. A false alarm. Got mistaken with what IDEA calls "unused" may
not really be unused. It is (re)assigned in StageInfo.fromStage for
a ShuffleMapStage [1] and then caught in ExecutorMonitor [2] (since it's a
SparkListener).

[1]
https://github.com/apache/spark/blob/094563384478a402c36415edf04ee7b884a34fc9/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala#L108
[2]
https://github.com/apache/spark/blob/78df2caec8c94c31e5c9ddc30ed8acb424084181/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L179

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Dec 30, 2020 at 3:34 PM Jacek Laskowski  wrote:

> Hi,
>
> It's been a while. Glad to be back Sparkians!
>
> I've been exploring ExecutorMonitor.onJobStart in 3.0.1 and noticed that
> it uses StageInfo.shuffleDepId [1] that is None by default and moreover
> never "written to" according to IntelliJ IDEA.
>
> Is this the case and intentional?
>
> I'm wondering how much IDEA knows about codegen and that's where it's used
> (?)
>
> I've just stumbled upon it and before I spend more time on this I thought
> I'd ask (perhaps it's going to change in 3.1?). Help appreciated.
>
> [1]
> https://github.com/apache/spark/blob/78df2caec8c94c31e5c9ddc30ed8acb424084181/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L179
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>

[3.0.1] ExecutorMonitor.onJobStart and StageInfo.shuffleDepId that's never used?

2020-12-30 Thread Jacek Laskowski

Hi,

It's been a while. Glad to be back Sparkians!

I've been exploring ExecutorMonitor.onJobStart in 3.0.1 and noticed that it
uses StageInfo.shuffleDepId [1] that is None by default and moreover never
"written to" according to IntelliJ IDEA.

Is this the case and intentional?

I'm wondering how much IDEA knows about codegen and that's where it's used
(?)

I've just stumbled upon it and before I spend more time on this I thought
I'd ask (perhaps it's going to change in 3.1?). Help appreciated.

[1]
https://github.com/apache/spark/blob/78df2caec8c94c31e5c9ddc30ed8acb424084181/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L179

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: Incorrect Scala version for Spark 2.4.x releases in the docs?

2020-09-17 Thread Jacek Laskowski

Thanks Sean for such a quick response! Let me propose a fix for the docs.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Sep 17, 2020 at 4:16 PM Sean Owen  wrote:

> The pre-built binary distros should use 2.11 in 2.4.x. Artifacts for
> both Scala versions are available, yes.
> Yeah I think it should really say you can use 2.11 or 2.12.
>
> On Thu, Sep 17, 2020 at 9:12 AM Jacek Laskowski  wrote:
> >
> > Hi,
> >
> > Just found this paragraph in
> http://spark.apache.org/docs/2.4.6/index.html#downloading:
> >
> > "Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API,
> Spark 2.4.6 uses Scala 2.12. You will need to use a compatible Scala
> version (2.12.x)."
> >
> > That seems to contradict the version of Scala in the pom.xml [1] which
> is 2.11.12. I think this says that Spark 2.4.6 uses Scala 2.12 by default
> which is incorrect to me. Am I missing something?
> >
> > My question is what's the official Scala version of Spark 2.4.6 (and
> others in 2.4.x release line)?
> >
> > (I do know that Spark 2.4.x could be compiled with Scala 2.12, but that
> requires scala-2.12 profile [2] to be enabled)
> >
> > [1] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L158
> > [2] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L2830
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > "The Internals Of" Online Books
> > Follow me on https://twitter.com/jaceklaskowski
> >
>

Incorrect Scala version for Spark 2.4.x releases in the docs?

2020-09-17 Thread Jacek Laskowski

Hi,

Just found this paragraph in
http://spark.apache.org/docs/2.4.6/index.html#downloading:

"Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API,
Spark 2.4.6 uses Scala 2.12. You will need to use a compatible Scala
version (2.12.x)."

That seems to contradict the version of Scala in the pom.xml [1] which
is 2.11.12. I think this says that Spark 2.4.6 uses Scala 2.12 by default
which is incorrect to me. Am I missing something?

My question is what's the official Scala version of Spark 2.4.6 (and others
in 2.4.x release line)?

(I do know that Spark 2.4.x could be compiled with Scala 2.12, but that
requires scala-2.12 profile [2] to be enabled)

[1] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L158
[2] https://github.com/apache/spark/blob/v2.4.6/pom.xml#L2830

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Why is V2SessionCatalog not a CatalogExtension?

2020-08-08 Thread Jacek Laskowski

Hi,

Just started exploring Catalog Plugin API and noticed these two classes:
CatalogExtension and V2SessionCatalog.

Why is V2SessionCatalog not a CatalogExtension?

- V2SessionCatalog extends TableCatalog with SupportsNamespaces [1]
- CatalogExtension extends TableCatalog, SupportsNamespaces [2]

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L41
[2]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/CatalogExtension.java#L33

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Why time difference while registering a new BlockManager (using BlockManagerMasterEndpoint)?

2020-06-12 Thread Jacek Laskowski

Hi,

Just noticed an inconsistency between times when a BlockManager is about to
be registered [1][2] and the time listeners are going to be informed [3],
and got curious whether it's intentional or not.

Why is the `time` value not used for SparkListenerBlockManagerAdded message?

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L453
[2]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L478
[3]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L481

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: BlockManager and ShuffleManager = can getLocalBytes be ever used for shuffle blocks?

2020-04-26 Thread Jacek Laskowski

Thanks Yi Wu!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sat, Apr 18, 2020 at 12:17 PM wuyi  wrote:

> Hi Jacek,
>
> code in link [2] is the out of date. The commit
>
> https://github.com/apache/spark/commit/32ec528e63cb768f85644282978040157c3c2fb7
> has already removed unreachable branch.
>
>
> Best,
> Yi Wu
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

ShuffleMapStage and pendingPartitions vs isAvailable or findMissingPartitions?

2020-04-26 Thread Jacek Laskowski

Hi,

I found that ShuffleMapStage has this (apparently superfluous)
pendingPartitions registry [1] for DAGScheduler and the description says:

"  /**
   * Partitions that either haven't yet been computed, or that were
computed on an executor
   * that has since been lost, so should be re-computed.  This variable is
used by the
   * DAGScheduler to determine when a stage has completed. Task successes
in both the active
   * attempt for the stage or in earlier attempts for this stage can cause
paritition ids to get
   * removed from pendingPartitions. As a result, this variable may be
inconsistent with the pending
   * tasks in the TaskSetManager for the active attempt for the stage (the
partitions stored here
   * will always be a subset of the partitions that the TaskSetManager
thinks are pending).
   */
"

I'm curious why there is a need for this pendingPartitions
since isAvailable or findMissingPartitions (using MapOutputTrackerMaster)
know it already and I think are even more up-to-date. Why is there this
extra registry?

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala#L60

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

BlockManager and ShuffleManager = can getLocalBytes be ever used for shuffle blocks?

2020-04-16 Thread Jacek Laskowski

Hi,

While trying to understand the relationship of BlockManager
and ShuffleManager I found that ShuffleManager is used for shuffle block
data [1] (and that makes sense).

What I found quite surprising is that BlockManager can call getLocalBytes
for non-shuffle blocks that in turn does...fetching shuffle blocks (!) [2].

That begs the question, who could want this? In other words, what other
components would want to call BlockManager.getLocalBytes for shuffle
blocks? The only caller is TorrentBroadcast which is not for shuffle
blocks, is it?

In other words, getLocalBytes should NOT be bothering itself with fetching
local shuffle blocks as that's a responsibility of someone else,
i.e. ShuffleManager.

Is this correct?

I'd appreciate any help you can provide so I can understand the Spark core
better. Thanks!

[1]
https://github.com/apache/spark/blob/31734399d57f3c128e66b0f97ef83eb4c9165978/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L382
[2]
https://github.com/apache/spark/blob/31734399d57f3c128e66b0f97ef83eb4c9165978/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L637

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: InferFiltersFromConstraints logical optimization rule and Optimizer.defaultBatches?

2020-04-16 Thread Jacek Laskowski

Hi Jungtaek,

Thanks a lot for your answer. What you're saying reflects my understanding
perfectly. There's a small change, but makes understanding where rules are
used much simpler (= less confusing). I'll propose a PR and see where it
goes from there. Thanks!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Wed, Apr 15, 2020 at 7:55 AM Jungtaek Lim 
wrote:

> Please correct me if I'm missing something. At a glance, your statements
> look correct if I understand correctly. I guess it might be simply missed,
> but it sounds as pretty trivial one as only a line can be removed safely
> which won't affect anything. (filterNot should be retained even we remove
> the line to prevent extended rules to break this)
>
> On Sun, Apr 12, 2020 at 9:54 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I'm curious why there is a need to include InferFiltersFromConstraints
>> logical optimization in operatorOptimizationRuleSet value [1] that seems to
>> be only to exclude it later [2]?
>>
>> In other words, I think that simply removing InferFiltersFromConstraints
>> from operatorOptimizationRuleSet value [1] would make no change (except
>> removing a "dead code").
>>
>> Does this make sense? Could I be missing something?
>>
>> [1]
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L80
>> [2]
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L115
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>

InferFiltersFromConstraints logical optimization rule and Optimizer.defaultBatches?

2020-04-12 Thread Jacek Laskowski

Hi,

I'm curious why there is a need to include InferFiltersFromConstraints
logical optimization in operatorOptimizationRuleSet value [1] that seems to
be only to exclude it later [2]?

In other words, I think that simply removing InferFiltersFromConstraints
from operatorOptimizationRuleSet value [1] would make no change (except
removing a "dead code").

Does this make sense? Could I be missing something?

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L80
[2]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L115

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Re: [DOCS] Spark SQL Upgrading Guide

2020-02-16 Thread Jacek Laskowski

Hi,

Never mind. Found this [1]:

> This config is deprecated and it will be removed in 3.0.0.

And so it has :) Thanks and sorry for the trouble.

[1]
https://github.com/apache/spark/blob/830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L1306-L1307

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sat, Feb 15, 2020 at 7:44 PM Jacek Laskowski  wrote:

> Hi,
>
> Just noticed that
> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html (Spark
> 2.4.5) has formatting issues in "Upgrading from Spark SQL 2.4.3 to 2.4.4"
> [1] which got fixed in master [2]. That's OK.
>
> What made me wonder was the other change to the section "Upgrading from
> Spark SQL 2.4 to 2.4.5" [3] that had the following item included:
>
> "Starting from 2.4.5, SQL configurations are effective also when a Dataset
> is converted to an RDD and its plan is executed due to action on the
> derived RDD. The previous behavior can be restored setting
> spark.sql.legacy.rdd.applyConf to false: in this case, SQL configurations
> are ignored for operations performed on a RDD derived from a Dataset."
>
> Why was this removed in master [4]? It was mentioned in "Notable changes"
> of Spark Release 2.4.5 [5].
>
> [1]
> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-243-to-244
> [2]
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-243-to-244
> [3]
> http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-245
> [4]
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-244-to-245
> [5] http://spark.apache.org/releases/spark-release-2-4-5.html
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>

[DOCS] Spark SQL Upgrading Guide

2020-02-15 Thread Jacek Laskowski

Hi,

Just noticed that
http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html (Spark
2.4.5) has formatting issues in "Upgrading from Spark SQL 2.4.3 to 2.4.4"
[1] which got fixed in master [2]. That's OK.

What made me wonder was the other change to the section "Upgrading from
Spark SQL 2.4 to 2.4.5" [3] that had the following item included:

"Starting from 2.4.5, SQL configurations are effective also when a Dataset
is converted to an RDD and its plan is executed due to action on the
derived RDD. The previous behavior can be restored setting
spark.sql.legacy.rdd.applyConf to false: in this case, SQL configurations
are ignored for operations performed on a RDD derived from a Dataset."

Why was this removed in master [4]? It was mentioned in "Notable changes"
of Spark Release 2.4.5 [5].

[1]
http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-243-to-244
[2]
https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-243-to-244
[3]
http://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-24-to-245
[4]
https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md#upgrading-from-spark-sql-244-to-245
[5] http://spark.apache.org/releases/spark-release-2-4-5.html

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>

Does StreamingSymmetricHashJoinExec work with watermark? I don't think so

2019-11-11 Thread Jacek Laskowski

Hi,

I think watermark does not work for StreamingSymmetricHashJoinExec because
of the following:

1. leftKeys and rightKeys have no spark.watermarkDelayMs metadata entry at
planning [1]
2. Since the left and right keys had no watermark delay at planning the
code [2] won't find it at execution

Is my understanding correct? If not, can you point me at examples with
watermark on 1) join keys and 2) values ? Merci beaucoup.

[1]
https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L477-L478

[2]
https://github.com/apache/spark/blob/v3.0.0-preview/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala#L156-L164

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

Re: [SS][2.4.4] Confused with "WatermarkTracker: Event time watermark didn't move"?

2019-10-14 Thread Jacek Laskowski

Hi HeartSaVioR,

> It might be due to empty batch

Yeah...that's my understanding too. It's for a streaming aggregation in
Append output mode so that's possible. I'll have a closer look at it.

Thanks much for keeping up with this and the other questions. Much
appreciated!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Mon, Oct 14, 2019 at 12:42 AM Jungtaek Lim 
wrote:

> It might be due to empty batch (activated when there're stateful
> operator(s) and the previous batch advances watermark), which has no input
> so no moving watermark.
>
> Did you only turn on DEBUG for WatermarkTracker? If you turn on DEBUG for
> MicroBatchExecution as well, it would log "Completed batch " so if
> I'm not missing, it should be logged between updating event-time watermark
> and watermark didn't move. You can attach streaming query listener and get
> more information about batches.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Tue, Oct 8, 2019 at 6:12 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I haven't spent much time on it, but the following DEBUG message
>> from WatermarkTracker sparked my interest :)
>>
>> I ran a streaming aggregation in Append mode and got the messages:
>>
>> 19/10/08 10:48:56 DEBUG WatermarkTracker: Observed event time stats 0:
>> EventTimeStats(15000,1000,8000.0,2)
>> 19/10/08 10:48:56 INFO WatermarkTracker: Updating event-time watermark
>> from 0 to 5000 ms
>> 19/10/08 10:48:56 DEBUG WatermarkTracker: Event time watermark didn't
>> move: 5000 < 5000
>>
>> I think the DEBUG message "Event time watermark didn't move" seems
>> incorrect given that the query has just started and "Observed event time
>> stats". It's true that the event-time watermark didn't move if it was 5000
>> before, but it was not as it has just started from scratch (no checkpointed
>> state).
>>
>> Can anyone shed some light on this? I'll be digging deeper in a bit, but
>> am hoping to get some more info before. Thanks!
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>> The Internals of Spark Structured Streaming
>> https://bit.ly/spark-structured-streaming
>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>

Re: [SS] number of output rows metric for streaming aggregation (StateStoreSaveExec) in Append output mode not measured?

2019-10-13 Thread Jacek Laskowski

Hi,

That was really quick! #impressed

Thanks HeartSaVioR for such prompt response! I'm fine with the current
state of the issue = no need to change anything. Whatever makes Spark more
shiny WFM!

Jacek

On Sun, 13 Oct 2019, 09:19 Jungtaek Lim, 
wrote:

> Filed SPARK-29450 [1] and raised a patch [2]. Please let me know if you
> would like to be assigned as a reporter of SPARK-29450.
>
> 1. https://issues.apache.org/jira/browse/SPARK-29450
> 2. https://github.com/apache/spark/pull/26104
>
> On Sun, Oct 13, 2019 at 4:06 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting. That might be possible it could be intentionally
>> excluded as it could cause some confusion before introducing empty batch
>> (given output rows are irrelevant to the input rows in current batch), but
>> given we have empty batch I'm not seeing the reason why we don't deal with
>> it. I'll file and submit a patch.
>>
>> Btw, there's a metric bug with empty batch as well - see SPARK-29314 [1]
>> which I've submitted a patch recently.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. http://issues.apache.org/jira/browse/SPARK-29314
>>
>>
>> On Sun, Oct 13, 2019 at 1:12 AM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I use Spark 2.4.4
>>>
>>> I've just noticed that the number of output rows metric
>>> of StateStoreSaveExec physical operator does not seem to be measured for
>>> Append output mode. In other words, whatever happens before or
>>> after StateStoreSaveExec operator the metric is always 0.
>>>
>>> It is measured for the other modes - Complete and Update.
>>>
>>> See
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L329-L365
>>>
>>> Is this intentional? Why?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>> The Internals of Spark Structured Streaming
>>> https://bit.ly/spark-structured-streaming
>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>

[SS] number of output rows metric for streaming aggregation (StateStoreSaveExec) in Append output mode not measured?

2019-10-12 Thread Jacek Laskowski

Hi,

I use Spark 2.4.4

I've just noticed that the number of output rows metric
of StateStoreSaveExec physical operator does not seem to be measured for
Append output mode. In other words, whatever happens before or
after StateStoreSaveExec operator the metric is always 0.

It is measured for the other modes - Complete and Update.

See
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L329-L365

Is this intentional? Why?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-10 Thread Jacek Laskowski

Hi,

Thanks much for such thorough conversation. Enjoyed it very much.

> Source/Sink traits are in org.apache.spark.sql.execution and thus they
are private.

That would explain why I couldn't find scaladocs.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Oct 9, 2019 at 7:46 AM Wenchen Fan  wrote:

> > Would you mind if I ask the condition of being public API?
>
> The module doesn't matter, but the package matters. We have many public
> APIs in the catalyst module as well. (e.g. DataType)
>
> There are 3 packages in Spark SQL that are meant to be private:
> 1. org.apache.spark.sql.catalyst
> 2. org.apache.spark.sql.execution
> 3. org.apache.spark.sql.internal
>
> You can check out the full list of private packages of Spark in
> project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages
>
> Basically, classes/interfaces that don't appear in the official Spark API
> doc are private.
>
> Source/Sink traits are in org.apache.spark.sql.execution and thus they are
> private.
>
> On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim 
> wrote:
>
>> Would you mind if I ask the condition of being public API? Source/Sink
>> traits are not marked as @DeveloperApi but they're defined as public, and
>> located to sql-core so even not semantically private (for catalyst), easy
>> to give a signal they're public APIs.
>>
>> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row]
>> is not available even for private API. There're some other approaches on
>> using private API: 1) SQLContext.internalCreateDataFrame - as it requires
>> RDD[InternalRow], they should also depend on catalyst and have to deal with
>> InternalRow which Spark community seems to be desired to change it
>> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in
>> catalyst. So they not only need to apply "package hack" but also need to
>> depend on catalyst.
>>
>>
>> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan  wrote:
>>
>>> AFAIK there is no public streaming data source API before DS v2. The
>>> Source and Sink API is private and is only for builtin streaming sources.
>>> Advanced users can still implement custom stream sources with private Spark
>>> APIs (you can put your classes under the org.apache.spark.sql package to
>>> access the private methods).
>>>
>>> That said, DS v2 is the first public streaming data source API. It's
>>> really hard to design a stable, efficient and flexible data source API that
>>> is unified between batch and streaming. DS v2 has evolved a lot in the
>>> master branch and hopefully there will be no big breaking changes anymore.
>>>
>>>
>>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> I remembered the actual case from developer who implements custom data
>>>> source.
>>>>
>>>>
>>>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E
>>>>
>>>> Quoting here:
>>>> We started implementing DSv2 in the 2.4 branch, but quickly discovered
>>>> that the DSv2 in 3.0 was a complete breaking change (to the point where it
>>>> could have been named DSv3 and it wouldn’t have come as a surprise). Since
>>>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
>>>> to fall back into DSv1 in order to ease the future transition to Spark 3.
>>>>
>>>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution
>>>> on dealing with DSv2 breaking change is having DSv1 as temporary solution,
>>>> even DSv2 for 3.x will be available. They need some time to make 
>>>> transition.
>>>>
>>>> I would file an issue to support streaming data source on DSv1 and
>>>> submit a patch unless someone objects.
>>>>
>>>>
>>>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski  wrote:
>>>>
>>>>> Hi Jungtaek,
>>>>>
>>>>> Thanks a lot for your very prompt response!
>>>>>
>>>>> > Looks like it's missing, or intended to force custom streaming
>>>>> source implemen

[SS][2.4.4] Confused with "WatermarkTracker: Event time watermark didn't move"?

2019-10-08 Thread Jacek Laskowski

Hi,

I haven't spent much time on it, but the following DEBUG message
from WatermarkTracker sparked my interest :)

I ran a streaming aggregation in Append mode and got the messages:

19/10/08 10:48:56 DEBUG WatermarkTracker: Observed event time stats 0:
EventTimeStats(15000,1000,8000.0,2)
19/10/08 10:48:56 INFO WatermarkTracker: Updating event-time watermark from
0 to 5000 ms
19/10/08 10:48:56 DEBUG WatermarkTracker: Event time watermark didn't move:
5000 < 5000

I think the DEBUG message "Event time watermark didn't move" seems
incorrect given that the query has just started and "Observed event time
stats". It's true that the event-time watermark didn't move if it was 5000
before, but it was not as it has just started from scratch (no checkpointed
state).

Can anyone shed some light on this? I'll be digging deeper in a bit, but am
hoping to get some more info before. Thanks!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-02 Thread Jacek Laskowski

Hi Jungtaek,

Thanks a lot for your very prompt response!

> Looks like it's missing, or intended to force custom streaming source
implemented as DSv2.

That's exactly my understanding = no more DSv1 data sources. That however
is not consistent with the official message, is it? Spark 2.4.4 does not
actually say "we're abandoning DSv1", and people could not really want to
jump on DSv2 since it's not recommended (unless I missed that).

I love surprises (as that's where people pay more for consulting :)), but
not necessarily before public talks (with one at SparkAISummit in two
weeks!) Gonna be challenging! Hope I won't spread a wrong word.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim 
wrote:

> Looks like it's missing, or intended to force custom streaming source
> implemented as DSv2.
>
> I'm not sure Spark community wants to expand DSv1 API: I could propose the
> change if we get some supports here.
>
> To Spark community: given we bring major changes on DSv2, someone would
> want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and
> new DSv2 gets stabilized. Would we like to provide necessary changes on
> DSv1?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I think I've got stuck and without your help I won't move any further.
>> Please help.
>>
>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
>> MicroBatch) and in getBatch phase when requested for a DataFrame, there is
>> this assert [1] I can't seem to go past with any DataFrame I managed to
>> create as it's not streaming.
>>
>>   assert(batch.isStreaming,
>> s"DataFrame returned by getBatch from $source did not have
>> isStreaming=true\n" +
>>   s"${batch.queryExecution.logical}")
>>
>> [1]
>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>>
>> All I could find is private[sql],
>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]
>>
>> [2]
>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
>> [3]
>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>> The Internals of Spark Structured Streaming
>> https://bit.ly/spark-structured-streaming
>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>

[SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-01 Thread Jacek Laskowski

Hi,

I think I've got stuck and without your help I won't move any further.
Please help.

I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
MicroBatch) and in getBatch phase when requested for a DataFrame, there is
this assert [1] I can't seem to go past with any DataFrame I managed to
create as it's not streaming.

  assert(batch.isStreaming,
s"DataFrame returned by getBatch from $source did not have
isStreaming=true\n" +
  s"${batch.queryExecution.logical}")

[1]
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441

All I could find is private[sql],
e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or [3]

[2]
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
[3]
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

Re: Welcoming some new committers and PMC members

2019-09-12 Thread Jacek Laskowski

Hi,

What a great news! Congrats to all awarded and the community for voting
them in!

p.s. I think it should go to the user mailing list too.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers and one PMC
> member. Join me in welcoming them to their new roles!
>
> New PMC member: Dongjoon Hyun
>
> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
> Weichen Xu, Ruifeng Zheng
>
> The new committers cover lots of important areas including ML, SQL, and
> data sources, so it’s great to have them here. All the best,
>
> Matei and the Spark PMC
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Why two netty libs?

2019-09-05 Thread Jacek Laskowski

Hi,

Thanks much for the answers. Learning Spark every day!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Wed, Sep 4, 2019 at 3:15 PM Sean Owen  wrote:

> Yes that's right. I don't think Spark's usage of ZK needs any ZK
> server, so it's safe to exclude in Spark (at least, so far so good!)
>
> On Wed, Sep 4, 2019 at 8:06 AM Steve Loughran
>  wrote:
> >
> > Zookeeper client is/was netty 3, AFAIK, so if you want to use it for
> anything, it ends up on the CP
> >
> > On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
> >>
> >> Yep, historical reasons. And Netty 4 is under another namespace, so we
> can use Netty 3 and Netty 4 in the same JVM.
> >>
> >> On Tue, Sep 3, 2019 at 6:15 AM Sean Owen  wrote:
> >>>
> >>> It was for historical reasons; some other transitive dependencies
> needed it.
> >>> I actually was just able to exclude Netty 3 last week from master.
> >>> Spark uses Netty 4.
> >>>
> >>> On Tue, Sep 3, 2019 at 6:59 AM Jacek Laskowski 
> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > Just noticed that Spark 2.4.x uses two netty deps of different
> versions. Why?
> >>> >
> >>> > jars/netty-all-4.1.17.Final.jar
> >>> > jars/netty-3.9.9.Final.jar
> >>> >
> >>> > Shouldn't one be excluded or perhaps shaded?
> >>> >
> >>> > Pozdrawiam,
> >>> > Jacek Laskowski
> >>> > 
> >>> > https://about.me/JacekLaskowski
> >>> > The Internals of Spark SQL https://bit.ly/spark-sql-internals
> >>> > The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> >>> > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> >>> > Follow me at https://twitter.com/jaceklaskowski
> >>> >
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >> --
> >>
> >> Best Regards,
> >>
> >> Ryan
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Why two netty libs?

2019-09-03 Thread Jacek Laskowski

Hi,

Just noticed that Spark 2.4.x uses two netty deps of different versions.
Why?

jars/netty-all-4.1.17.Final.jar
jars/netty-3.9.9.Final.jar

Shouldn't one be excluded or perhaps shaded?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

Re: [SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?

2019-09-03 Thread Jacek Laskowski

Hi Devs,

Thanks all for a very prompt response! That was insanely quick. Merci
beaucoup! :)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Mon, Aug 26, 2019 at 4:05 PM Jungtaek Lim  wrote:

> Thanks! The patch is here: https://github.com/apache/spark/pull/25583
>
> On Mon, Aug 26, 2019 at 11:02 PM Gabor Somogyi 
> wrote:
>
>> Just checked this and it's a copy-paste :) It works properly when
>> KafkaSourceInitialOffsetWriter used. Pull me in if review needed.
>>
>> BR,
>> G
>>
>>
>> On Mon, Aug 26, 2019 at 3:57 PM Jungtaek Lim  wrote:
>>
>>> Nice finding! I don't see any reason to not use
>>> KafkaSourceInitialOffsetWriter from KafkaSource, as they're identical. I
>>> guess it was copied and pasted sometime before and not addressed yet.
>>> As you haven't submit a patch, I'll submit a patch shortly, with
>>> mentioning credit. I'd close mine and wait for your patch if you plan to do
>>> it. Please let me know.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Mon, Aug 26, 2019 at 8:03 PM Jacek Laskowski  wrote:
>>>
>>>> Hi,
>>>>
>>>> Just found out that KafkaSource [1] does not
>>>> use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for
>>>> initial offsets.
>>>>
>>>> Any reason for that? Should I report an issue? Just checking out as I'm
>>>> with 2.4.3 exclusively and have no idea what's coming for 3.0.
>>>>
>>>> [1]
>>>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102
>>>>
>>>> [2]
>>>> https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> 
>>>> https://about.me/JacekLaskowski
>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>> The Internals of Spark Structured Streaming
>>>> https://bit.ly/spark-structured-streaming
>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>
>>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

[SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?

2019-08-26 Thread Jacek Laskowski

Hi,

Just found out that KafkaSource [1] does not
use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for
initial offsets.

Any reason for that? Should I report an issue? Just checking out as I'm
with 2.4.3 exclusively and have no idea what's coming for 3.0.

[1]
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102

[2]
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-11 Thread Jacek Laskowski

Hi,

Thanks Dongjoon Hyun for stepping up as a release manager!
Much appreciated.

If there's a volunteer to cut a release, I'm always to support it.

In addition, the more frequent releases the better for end users so they
have a choice to upgrade and have all the latest fixes or wait. It's their
call not ours (when we'd keep them waiting).

My big 2 yes'es for the release!

Jacek


On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:

> Hi, All.
>
> Spark 2.4.3 was released two months ago (8th May).
>
> As of today (9th July), there exist 45 fixes in `branch-2.4` including the
> following correctness or blocker issues.
>
> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
> decimals not fitting in long
> - SPARK-26045 Error in the spark 2.4 release package with the
> spark-avro_2.11 dependency
> - SPARK-27798 from_avro can modify variables in other rows in local
> mode
> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
> - SPARK-28308 CalendarInterval sub-second part should be padded before
> parsing
>
> It would be great if we can have Spark 2.4.4 before we are going to get
> busier for 3.0.0.
> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
> it next Monday. (15th July).
> How do you think about this?
>
> Bests,
> Dongjoon.
>

Re: [SS] Why EventTimeStatsAccum for event-time watermark not a named accumulator?

2019-06-11 Thread Jacek Laskowski

Hi,

After some thinking about it, I may have found out the reason why not to
expose EventTimeStatsAccum as a named accumulator. The reason is that it's
an internal part of how event-time watermark works and should not be
exposed via web UI as much as if it was part of a Spark app (the web UI is
meant for).

With that being said, I'm wondering why is EventTimeStatsAccum not a SQL
metric then? With that, it'd be in web UI, but just in the physical plan of
a streaming query.

WDYT?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Mon, Jun 10, 2019 at 8:59 PM Jacek Laskowski  wrote:

> Hi,
>
> I'm curious why EventTimeStatsAccum is not a named accumulator (not to
> mention SQLMetric) so the event-time watermark could be monitored in web UI?
>
> I've changed the code for EventTimeWatermarkExec physical operator to
> register EventTimeStatsAccum as a named accumulator and the values are
> properly propagated back to the driver and the web UI. It seems to be
> working fine (and it's just a one-day coding).
>
> It went fairly easy to have a very initial prototype so I'm wondering why
> it's not been included? Has this been considered? Should I file a
> JIRA ticket and send a pull request for review? Please guide as I found it
> very helpful (and surprisingly easy to implement so I'm worried I'm missing
> something important). Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> Follow me at https://twitter.com/jaceklaskowski
>
>

[SS] Why EventTimeStatsAccum for event-time watermark not a named accumulator?

2019-06-10 Thread Jacek Laskowski

Hi,

I'm curious why EventTimeStatsAccum is not a named accumulator (not to
mention SQLMetric) so the event-time watermark could be monitored in web UI?

I've changed the code for EventTimeWatermarkExec physical operator to
register EventTimeStatsAccum as a named accumulator and the values are
properly propagated back to the driver and the web UI. It seems to be
working fine (and it's just a one-day coding).

It went fairly easy to have a very initial prototype so I'm wondering why
it's not been included? Has this been considered? Should I file a
JIRA ticket and send a pull request for review? Please guide as I found it
very helpful (and surprisingly easy to implement so I'm worried I'm missing
something important). Thanks.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

[SS] ContinuousExecution.commit and excessive JSON serialization?

2019-06-03 Thread Jacek Laskowski

Hi,

Why does ContinuousExecution.commit serialize an offset to JSON format just
before deserializing it back (from JSON to an offset)? [1]

val offset =

sources(0).deserializeOffset(offsetLog.get(epoch).get.offsets(0).get.json)

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L341

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski

FileSourceScanExec.doExecute - when is this executed if ever?

2019-04-26 Thread Jacek Laskowski

Hi,

I may have asked this question before, but seems I forgot/can't find the
answer.

When is FileSourceScanExec.doExecute executed if ever?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: Is there a way to read a Parquet File as ColumnarBatch?

2019-04-22 Thread Jacek Laskowski

Hi Priyanka,

I've been exploring this part of Spark SQL and could help a little bit.

> but for some reason it never hit the breakpoints I placed in these
classes.

Was this for local[*]? I ran
"SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005"
./bin/spark-shell" and attached IDEA to debug the code. I used Spark 2.4.1
(with Scala 2.12) and it worked fine for the following queries:

spark.range(5).write.save("hello")
spark.read.parquet("hello").show

> has anyone tried returning all the data as ColumnarBatch? Is there any
reading material you can point me to?

You may find some information in the internals book at
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnarBatch.html
It's
WIP. Let me know what part to explore in more detail. I'd do this
momentarily (as I'm exploring parquet data source in more detail as we
speak).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Apr 22, 2019 at 10:29 AM Priyanka Gomatam
 wrote:

> Hi,
>
> I am new to Spark and have been playing around with the Parquet reader
> code. I have two questions:
>
>1. I saw the code that starts at DataSourceScanExec class, and moves
>on to the ParquetFileFormat class and does a VectorizedParquetRecordReader.
>I tried doing a spark.read.parquet(…) and debugged through the code, but
>for some reason it never hit the breakpoints I placed in these classes.
>Perhaps I am doing something wrong, but is there a certain versioning for
>parquet readers that I am missing out on? How do I make the code take the
>DataSourceScanExec -> … -> ParquetReader … -> VectorizedParqeutRecordRead …
>route?
>2. If I do manage to make it take the above path, I see there is a
>point at which the data is filled into ColumnarBatch objects, has anyone
>tried returning all the data as ColumnarBatch? Is there any reading
>material you can point me to?
>
> Thanks in advance, this will be super helpful for me!
>

Re: Sort order in bucketing in a custom datasource

2019-04-16 Thread Jacek Laskowski

Hi,

I don't think so. I can't think of an interface (trait) that would give
that information to the Catalyst optimizer.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Tue, Apr 16, 2019 at 6:31 PM Long, Andrew 
wrote:

> Hey Friends,
>
>
>
> Is it possible to specify the sort order or bucketing in a way that can be
> used by the optimizer in spark?
>
>
>
> Cheers Andrew
>

Re: Static functions

2019-02-11 Thread Jacek Laskowski

Hi Jean,

I thought the functions have already been tagged?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Sun, Feb 10, 2019 at 11:48 PM Jean Georges Perrin  wrote:

> Hey guys,
>
> We have 381 static functions now (including the deprecated). I am trying
> to sort them out by group/tag them.
>
> So far, I have:
>
>- Array
>- Conversion
>- Date
>- Math
>   - Trigo (sub group of maths)
>- Security
>- Streaming
>- String
>- Technical
>
> Do you see more categories? Tags?
>
> Thanks!
>
> jg
>
> —
> Jean Georges Perrin / @jgperrin
>
>

Re: [SS] FlatMapGroupsWithStateExec with no commitTimeMs metric?

2018-11-26 Thread Jacek Laskowski

Thanks Jungtaek Lim!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Mon, Nov 26, 2018 at 7:26 AM Jungtaek Lim  wrote:

> Looks like just a kind of missing spot. Just crafted the patch and now in
> progress of testing.
>
> Once done with testing I'll file a new issue and submit a patch.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 11월 26일 (월) 오전 4:49, Burak Yavuz 님이 작성:
>
>> Probably just oversight. Anyone is welcome to add it :)
>>
>> On Sun, Nov 25, 2018 at 8:55 AM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Why is FlatMapGroupsWithStateExec not measuring the time taken on state
>>> commit [1](like StreamingDeduplicateExec [2] and StreamingGlobalLimitExec
>>> [3])? Is this on purpose? Why? Thanks.
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala#L135
>>>
>>> [2]
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L482
>>>
>>> [3]
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingGlobalLimitExec.scala#L87
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>

[SS] FlatMapGroupsWithStateExec with no commitTimeMs metric?

2018-11-25 Thread Jacek Laskowski

Hi,

Why is FlatMapGroupsWithStateExec not measuring the time taken on state
commit [1](like StreamingDeduplicateExec [2] and StreamingGlobalLimitExec
[3])? Is this on purpose? Why? Thanks.

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala#L135

[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L482

[3]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingGlobalLimitExec.scala#L87

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: Is spark.sql.codegen.factoryMode property really for tests only?

2018-11-16 Thread Jacek Laskowski

Hi Marco,

Many thanks for such a quick response. With that, I'll direct my curiosity
into a different direction. Thanks!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Fri, Nov 16, 2018 at 1:44 PM Marco Gaido  wrote:

> Hi Jacek,
>
> I do believe it is correct. Please check the method you mentioned
> (CodeGeneratorWithInterpretedFallback.createObject): the value is relevant
> only if Utils.isTesting.
>
> Thanks,
> Marco
>
> Il giorno ven 16 nov 2018 alle ore 13:28 Jacek Laskowski 
> ha scritto:
>
>> Hi,
>>
>> While reviewing the changes in 2.4 I stumbled
>> upon spark.sql.codegen.factoryMode internal configuration property [1]. The
>> doc says:
>>
>> > Note that this config works only for tests.
>>
>> Is that correct? I've got some doubts.
>>
>> I found that it's used in UnsafeProjection.create [2] (through
>> CodeGeneratorWithInterpretedFallback.createObject) which is used outside
>> the tests and so made me think if "this config works only for tests" part
>> is correct.
>>
>> Are my doubts correct? If not, what am I missing? Thanks.
>>
>> [1]
>> https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L758-L767
>>
>> [2]
>> https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L159
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>

Is spark.sql.codegen.factoryMode property really for tests only?

2018-11-16 Thread Jacek Laskowski

Hi,

While reviewing the changes in 2.4 I stumbled
upon spark.sql.codegen.factoryMode internal configuration property [1]. The
doc says:

> Note that this config works only for tests.

Is that correct? I've got some doubts.

I found that it's used in UnsafeProjection.create [2] (through
CodeGeneratorWithInterpretedFallback.createObject) which is used outside
the tests and so made me think if "this config works only for tests" part
is correct.

Are my doubts correct? If not, what am I missing? Thanks.

[1]
https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L758-L767

[2]
https://github.com/apache/spark/blob/v2.4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L159

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

How to know all the issues resolved for 2.4.0?

2018-11-07 Thread Jacek Laskowski

Hi,

I've been trying to find out the issues that are part of 2.4.0 and used the
following query:

project = SPARK AND resolution in (Resolved, Done, Fixed) and "Target
Version/s" = "2.4.0"

I got 202 issues. Is that correct? What's the difference between the
Resolution statuses: Resolved, Done, Fixed? When is an issue marked as
either of them?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Why does spark.range(1).write.mode("overwrite").saveAsTable("t1") throw an Exception?

2018-10-30 Thread Jacek Laskowski

Hi,

Just ran into it today and wonder whether it's a bug or something I may
have missed before.

scala> spark.version
res21: String = 2.3.2

// that's OK
scala> spark.range(1).write.saveAsTable("t1")
org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
  at
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
  at
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
  ... 51 elided

// Let's overwrite it then
// An exception?! Why?!
scala> spark.range(1).write.mode("overwrite").saveAsTable("t1")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet.
It must be specified manually.;
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
  at scala.Option.getOrElse(Option.scala:121)
...

// If the above works properly, why does the following work fine (and not
throw an exception)?
scala> spark.range(1).write.saveAsTable("t10")

p.s. I was not sure whether I should be sending the question to dev or
users so accept my apologizes when sent to a wrong mailing list.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: welcome a new batch of committers

2018-10-06 Thread Jacek Laskowski

Wow! That's a nice bunch of contributors. Congrats to all new committers.
I've had tough times to follow all the contributions, but with this crew
it's gonna be nearly impossible.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Wed, Oct 3, 2018 at 10:59 AM Reynold Xin  wrote:

> Hi all,
>
> The Apache Spark PMC has recently voted to add several new committers to
> the project, for their contributions:
>
> - Shane Knapp (contributor to infra)
> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> - Kazuaki Ishizaki (contributor to Spark SQL)
> - Xingbo Jiang (contributor to Spark Core and SQL)
> - Yinan Li (contributor to Spark on Kubernetes)
> - Takeshi Yamamuro (contributor to Spark SQL)
>
> Please join me in welcoming them!
>
>

Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?

2018-10-01 Thread Jacek Laskowski

Hi,

OK. Sorry for the noise. I don't know why it started working, but I cannot
reproduce it anymore. Sorry for a false alarm (but I could promise it
didn't work and I changed nothing). Back to work...

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Mon, Oct 1, 2018 at 8:45 AM Jyoti Ranjan Mahapatra 
wrote:

> Hi Jacek,
>
>
>
> The issue might not be very widespread. I couldn’t reproduce it. Can you
> see if I am doing anything incorrect in the below queries?
>
>
>
> scala> spark.range(10).write.saveAsTable("t1")
>
>
>
> scala> spark.sql("describe formatted t1").show(100, false)
>
>
> ++---+---+
>
> |col_name
> |data_type
> |comment|
>
>
> ++---+---+
>
> |id
> |bigint
> |null   |
>
> |
>|
> |   |
>
> |# Detailed Table
> Information|
> |   |
>
> |Database
>  |default
> |   |
>
> |Table
> |t1
> |   |
>
> |Owner
> |jyotima
> |   |
>
> |Created Time|Sun Sep 30 23:40:46 PDT
> 2018   |   |
>
> |Last Access |Wed Dec 31 16:00:00 PST
> 1969   |   |
>
> |Created By  |Spark
> 2.3.2
> |   |
>
> |Type|MANAGED
>|   |
>
> |Provider
> |parquet
> |   |
>
> |Table Properties
> |[transient_lastDdlTime=1538376046]
> |   |
>
> |Statistics  |3008
> bytes
> |   |
>
> |Location
> |file:/home/jyotima/repo/tmp/spark2.3.2/spark-2.3.2-bin-hadoop2.7/spark-warehouse/t1|
> |
>
> |Serde Library
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> |   |
>
> |InputFormat
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> |   |
>
> |OutputFormat
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> |   |
>
> |Storage Properties
> |[serialization.format=1]
>|   |
>
>
> ++---+---+
>
>
>
> scala> spark.version
>
> res4: String = 2.3.2
>
>
>
> Thanks,
>
> Jyoti
>
> *From:* Jacek Laskowski 
> *Sent:* Sunday, September 30, 2018 11:28 PM
> *To:* Sean Owen 
> *Cc:* dev 
> *Subject:* Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works
> fine?
>
>
>
> Hi Sean,
>
>
>
> Thanks again for helping me to remain sane and that the issue is not
> imaginary :)
>
>
>
> I'd expect to be spark-warehouse in the directory where spark-shell is
> executed (which is what has always been used for the metastore).
>
>
>
> I'm reviewing all the changes between 2.3.1..2.3.2 to find anything
> relevant. I'm surprised nobody's reported it before. That worries me (or
> simply says that all the enterprise deployments simply use YARN with Hive?)
>
>
> Pozdrawiam,
>
> Jacek Laskowski
>
> 
>
> https://about.me/JacekLaskowski
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2FJacekLaskowski&data=02%7C01%7Cjyotima%40microsoft.com%7C7d66eb6f8e7a44ffedbe08d627672c86%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636739721428657573&sdata=9SJym%2B41JIxnZnRvtdBkGoV0DFl7YEBRK7ZTa1XsSMQ%3D&reserved=0>
>
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbit.ly%2Fmastering-spark-sql&data=02%7C01%7Cjyotima%40microsoft.com%7C7d66eb6f8e7a44ffedbe08d627672c86%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636739721428667587&sdata=wewZO8MBXR9dM8zF1FGK%2FjXxlEOb%2FFqQc8LDKSBW66A%3D&reserved=0>
>
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbit.ly%2Fspark-structured-streaming&data=02%7C01%7Cjyotima%40microsoft.com%7C7d66eb6f8e7a44ffedbe08d627672c86%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636739721428677578&sdata=TdX6tZltzBTn1vrB5N4ugqoshBD7qBks2Q1AW%2F

Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?

2018-09-30 Thread Jacek Laskowski

Hi Sean,

Thanks again for helping me to remain sane and that the issue is not
imaginary :)

I'd expect to be spark-warehouse in the directory where spark-shell is
executed (which is what has always been used for the metastore).

I'm reviewing all the changes between 2.3.1..2.3.2 to find anything
relevant. I'm surprised nobody's reported it before. That worries me (or
simply says that all the enterprise deployments simply use YARN with Hive?)

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Sun, Sep 30, 2018 at 10:25 PM Sean Owen  wrote:

> Hm, changes in the behavior of the default warehouse dir sound
> familiar, but anything I could find was resolved well before 2.3.1
> even. I don't know of a change here. What location are you expecting?
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12343289
> On Sun, Sep 30, 2018 at 1:38 PM Jacek Laskowski  wrote:
> >
> > Hi Sean,
> >
> > I thought so too, but the path "file:/user/hive/warehouse/" should not
> have been used in the first place, should it? I'm running it in spark-shell
> 2.3.2. Why would there be any changes between 2.3.1 and 2.3.2 that I just
> downloaded and one worked fine while the other did not? I had to downgrade
> to 2.3.1 because of this (and do want to figure out why 2.3.2 behaves in a
> different way).
> >
> > The part of the stack trace is below.
> >
> > ➜  spark-2.3.2-bin-hadoop2.7 ./bin/spark-shell
> > 2018-09-30 17:43:49 WARN  NativeCodeLoader:62 - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> > Setting default log level to "WARN".
> > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
> > Spark context Web UI available at http://192.168.0.186:4040
> > Spark context available as 'sc' (master = local[*], app id =
> local-1538322235135).
> > Spark session available as 'spark'.
> > Welcome to
> >     __
> >  / __/__  ___ _/ /__
> > _\ \/ _ \/ _ `/ __/  '_/
> >/___/ .__/\_,_/_/ /_/\_\   version 2.3.2
> >   /_/
> >
> > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_171)
> > Type in expressions to have them evaluated.
> > Type :help for more information.
> >
> > scala> spark.version
> > res0: String = 2.3.2
> >
> > scala> spark.range(1).write.saveAsTable("demo")
> > 2018-09-30 17:44:27 WARN  ObjectStore:568 - Failed to get database
> global_temp, returning NoSuchObjectException
> > 2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs failed to
> create file:/user/hive/warehouse/demo/_temporary/0
> > 2018-09-30 17:44:28 ERROR Utils:91 - Aborting task
> > java.io.IOException: Mkdirs failed to create
> file:/user/hive/warehouse/demo/_temporary/0/_temporary/attempt_20180930174428__m_07_0
> (exists=false, cwd=file:/Users/jacek/dev/apps/spark-2.3.2-bin-hadoop2.7)
> > at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455)
> > at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
> > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
> > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
> > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
> > at
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:241)
> > at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
> > at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
> > at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
> > at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
> > at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
> > at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
> > at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:26

Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?

2018-09-30 Thread Jacek Laskowski

Hi Sean,

I thought so too, but the path "file:/user/hive/warehouse/" should not have
been used in the first place, should it? I'm running it in spark-shell
2.3.2. Why would there be any changes between 2.3.1 and 2.3.2 that I just
downloaded and one worked fine while the other did not? I had to downgrade
to 2.3.1 because of this (and do want to figure out why 2.3.2 behaves in a
different way).

The part of the stack trace is below.

➜  spark-2.3.2-bin-hadoop2.7 ./bin/spark-shell
2018-09-30 17:43:49 WARN  NativeCodeLoader:62 - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.186:4040
Spark context available as 'sc' (master = local[*], app id =
local-1538322235135).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.version
res0: String = 2.3.2

scala> spark.range(1).write.saveAsTable("demo")
2018-09-30 17:44:27 WARN  ObjectStore:568 - Failed to get database
global_temp, returning NoSuchObjectException
2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs failed to create
file:/user/hive/warehouse/demo/_temporary/0
2018-09-30 17:44:28 ERROR Utils:91 - Aborting task
java.io.IOException: Mkdirs failed to create
file:/user/hive/warehouse/demo/_temporary/0/_temporary/attempt_20180930174428__m_07_0
(exists=false, cwd=file:/Users/jacek/dev/apps/spark-2.3.2-bin-hadoop2.7)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
at
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:241)
at
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
at
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Sat, Sep 29, 2018 at 9:50 PM Sean Owen  wrote:

> Looks like a permission issue? Are you sure that isn't the difference,
> first?
>
> On Sat, Sep 29, 2018, 1:54 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> The following query fails in 2.3.2:
>>
>> scala> spark.range(10).write.saveAsTable("t1")
>> ...
>> 2018-09-29 20:48:06 ERROR FileOutputCommitter:314 - Mkdirs fai

saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?

2018-09-29 Thread Jacek Laskowski

Hi,

The following query fails in 2.3.2:

scala> spark.range(10).write.saveAsTable("t1")
...
2018-09-29 20:48:06 ERROR FileOutputCommitter:314 - Mkdirs failed to create
file:/user/hive/warehouse/bucketed/_temporary/0
2018-09-29 20:48:07 ERROR Utils:91 - Aborting task
java.io.IOException: Mkdirs failed to create
file:/user/hive/warehouse/bucketed/_temporary/0/_temporary/attempt_20180929204807__m_03_0
(exists=false, cwd=file:/Users/jacek/dev/apps/spark-2.3.2-bin-hadoop2.7)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:455)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

While it works fine in 2.3.1.

Could anybody explain the change in behaviour in 2.3.2? The commit / the
JIRA issue would be even nicer. Thanks.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

2018-09-08 Thread Jacek Laskowski

Hi Herman,

Right. No @deprecated, but something that would tell people who review the
code "be extra careful since you're reading code that is no longer in use"
for SparkPlans that do support WSCG. That would help a lot as I got tricked
few times already while trying to understand something that I should not
have been bothered much with.

Thanks Russ and Herman for your help to get my thinking right. That will
also help my Spark clients, esp. during Spark SQL workshops!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Sat, Sep 8, 2018 at 3:53 PM Herman van Hovell 
wrote:

> ...pressed send to early...
>
> Moreover the we can't always use whole stage code generation. In that case
> we fall back to vulcano style execution, and chain together doExecute()
> calls.
>
> On Sat, Sep 8, 2018 at 3:51 PM Herman van Hovell 
> wrote:
>
>> SparkPlan.doExecute() is the only way you can execute a physical SQL
>> plan, so it should *not* be marked as deprecated. Wholestage code
>> generation collapses a subtree of SparkPlans (that support whole stage
>> codegeneration) into a single WholeStageCodegenExec pyhsical plan.
>> During execution we call doExecute() on the WholeStageCodegenExec node.
>>
>> On Sat, Sep 8, 2018 at 11:55 AM Jacek Laskowski  wrote:
>>
>>> Thanks Russ! That helps a lot.
>>>
>>> On the other hand makes reviewing the codebase of Spark SQL slightly
>>> harder since Java code generation is so much about string concatenation :(
>>>
>>> p.s. Should all the code in doExecute be considered and marked
>>> @deprecated?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Fri, Sep 7, 2018 at 10:05 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> That's my understanding :) doExecute is for non-codegen while doProduce
>>>> and Consume are for generating code
>>>>
>>>> On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski  wrote:
>>>>
>>>>> Hi Devs,
>>>>>
>>>>> Sorry for bothering you with my questions (and concerns), but I really
>>>>> need to understand this piece of code (= my personal challenge :))
>>>>>
>>>>> Is this true that SparkPlan.doExecute (to "execute" a physical
>>>>> operator) is only used when whole-stage code gen is disabled (which is not
>>>>> by default)? May I call this execution path traditional (even
>>>>> "old-fashioned")?
>>>>>
>>>>> Is this true that these days SparkPlan.doProduce and
>>>>> SparkPlan.doConsume (and others) are used for "executing" a physical
>>>>> operator (i.e. to generate the Java source code) since whole-stage code
>>>>> generation is enabled and is currently the proper execution path?
>>>>>
>>>>> p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen
>>>>> by WholeStageCodegenExec (and InputAdapter), but that's all the code that
>>>>> is to be executed by doExecute, isn't it?
>>>>>
>>>>> Pozdrawiam,
>>>>> Jacek Laskowski
>>>>> 
>>>>> https://about.me/JacekLaskowski
>>>>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>>>>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>>>>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>>
>>>>>
>>>>> On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski 
>>>>> wrote:
>>>>>
>>>>>> Hi Spark Devs,
>>>>>>
>>>>>> I really need your help understanding the relationship
>>>>>> between HashAggregateExec, TungstenAggregationIterator and
>>>>>> UnsafeFixedWidthAggregationMap.
>>>>>>
>>>>>> W

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

2018-09-08 Thread Jacek Laskowski

Thanks Russ! That helps a lot.

On the other hand makes reviewing the codebase of Spark SQL slightly harder
since Java code generation is so much about string concatenation :(

p.s. Should all the code in doExecute be considered and marked @deprecated?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Fri, Sep 7, 2018 at 10:05 PM Russell Spitzer 
wrote:

> That's my understanding :) doExecute is for non-codegen while doProduce
> and Consume are for generating code
>
> On Fri, Sep 7, 2018 at 2:59 PM Jacek Laskowski  wrote:
>
>> Hi Devs,
>>
>> Sorry for bothering you with my questions (and concerns), but I really
>> need to understand this piece of code (= my personal challenge :))
>>
>> Is this true that SparkPlan.doExecute (to "execute" a physical operator)
>> is only used when whole-stage code gen is disabled (which is not by
>> default)? May I call this execution path traditional (even "old-fashioned")?
>>
>> Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume
>> (and others) are used for "executing" a physical operator (i.e. to generate
>> the Java source code) since whole-stage code generation is enabled and is
>> currently the proper execution path?
>>
>> p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen
>> by WholeStageCodegenExec (and InputAdapter), but that's all the code that
>> is to be executed by doExecute, isn't it?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski  wrote:
>>
>>> Hi Spark Devs,
>>>
>>> I really need your help understanding the relationship
>>> between HashAggregateExec, TungstenAggregationIterator and
>>> UnsafeFixedWidthAggregationMap.
>>>
>>> While exploring UnsafeFixedWidthAggregationMap and how it's used I've
>>> noticed that it's for HashAggregateExec and TungstenAggregationIterator
>>> exclusively. And given that TungstenAggregationIterator is used exclusively
>>> in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both
>>> seems to be almost the same (if not the same), I've got a question I cannot
>>> seem to answer myself.
>>>
>>> Since HashAggregateExec supports Whole-Stage Codegen
>>> HashAggregateExec.doExecute won't be used at all, but doConsume and
>>> doProduce (unless codegen is disabled). Is that correct?
>>>
>>> If so, TungstenAggregationIterator is not used at all, but
>>> UnsafeFixedWidthAggregationMap is used directly instead (in the Java code
>>> that uses createHashMap or finishAggregate). Is that correct?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>

Re: Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

2018-09-07 Thread Jacek Laskowski

Hi Devs,

Sorry for bothering you with my questions (and concerns), but I really need
to understand this piece of code (= my personal challenge :))

Is this true that SparkPlan.doExecute (to "execute" a physical operator) is
only used when whole-stage code gen is disabled (which is not by default)?
May I call this execution path traditional (even "old-fashioned")?

Is this true that these days SparkPlan.doProduce and SparkPlan.doConsume
(and others) are used for "executing" a physical operator (i.e. to generate
the Java source code) since whole-stage code generation is enabled and is
currently the proper execution path?

p.s. This SparkPlan.doExecute is used to trigger whole-stage code gen
by WholeStageCodegenExec (and InputAdapter), but that's all the code that
is to be executed by doExecute, isn't it?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Fri, Sep 7, 2018 at 7:24 PM Jacek Laskowski  wrote:

> Hi Spark Devs,
>
> I really need your help understanding the relationship
> between HashAggregateExec, TungstenAggregationIterator and
> UnsafeFixedWidthAggregationMap.
>
> While exploring UnsafeFixedWidthAggregationMap and how it's used I've
> noticed that it's for HashAggregateExec and TungstenAggregationIterator
> exclusively. And given that TungstenAggregationIterator is used exclusively
> in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both
> seems to be almost the same (if not the same), I've got a question I cannot
> seem to answer myself.
>
> Since HashAggregateExec supports Whole-Stage Codegen
> HashAggregateExec.doExecute won't be used at all, but doConsume and
> doProduce (unless codegen is disabled). Is that correct?
>
> If so, TungstenAggregationIterator is not used at all, but
> UnsafeFixedWidthAggregationMap is used directly instead (in the Java code
> that uses createHashMap or finishAggregate). Is that correct?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>

Need help with HashAggregateExec, TungstenAggregationIterator and UnsafeFixedWidthAggregationMap

2018-09-07 Thread Jacek Laskowski

Hi Spark Devs,

I really need your help understanding the relationship
between HashAggregateExec, TungstenAggregationIterator and
UnsafeFixedWidthAggregationMap.

While exploring UnsafeFixedWidthAggregationMap and how it's used I've
noticed that it's for HashAggregateExec and TungstenAggregationIterator
exclusively. And given that TungstenAggregationIterator is used exclusively
in HashAggregateExec and the use of UnsafeFixedWidthAggregationMap in both
seems to be almost the same (if not the same), I've got a question I cannot
seem to answer myself.

Since HashAggregateExec supports Whole-Stage Codegen
HashAggregateExec.doExecute won't be used at all, but doConsume and
doProduce (unless codegen is disabled). Is that correct?

If so, TungstenAggregationIterator is not used at all, but
UnsafeFixedWidthAggregationMap is used directly instead (in the Java code
that uses createHashMap or finishAggregate). Is that correct?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Why is View logical operator not a UnaryNode explicitly?

2018-08-27 Thread Jacek Laskowski

Hi,

I've just come across View logical operator which is not a UnaryNode
explicitly, i.e. "extends UnaryNode". Why?

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala?utf8=%E2%9C%93#L460-L463

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Same code in DataFrameWriter.runCommand and Dataset.withAction?

2018-08-14 Thread Jacek Laskowski

Hi,

I'm curious why Spark SQL uses two different methods for the seemingly very
same code?

* DataFrameWriter.runCommand -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L663

* Dataset.withAction -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3317

It looks like the relationship is as follows:

DataFrameWriter.runCommand == Dataset.withAction(_.execute)

Should one be removed for the other? I'd first change runCommand to use
withAction(_.execute) or even remove runCommand altogether.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: Why is SQLImplicits an abstract class rather than a trait?

2018-08-05 Thread Jacek Laskowski

Hi Assaf,

No idea (and don't remember I've ever wondered about it before), but why
not doing this (untested):

trait MySparkTestTrait {
  lazy val spark: SparkSession = SparkSession.builder().getOrCreate() //
<-- you sure you don't need master?
  import spark.implicits._
}

Wouldn't that import work?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Sun, Aug 5, 2018 at 5:34 PM, assaf.mendelson 
wrote:

> Hi all,
>
> I have been playing a bit with SQLImplicits and noticed that it is an
> abstract class. I was wondering why is that? It has no constructor.
>
> Because of it being an abstract class it means that adding a test trait
> cannot extend it and still be a trait.
>
> Consider the following:
>
> trait MySparkTestTrait extends SQLImplicits {
>   lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
>   protected override def _sqlContext: SQLContext = spark.sqlContext
> }
>
>
> This would mean that if I can do something like this:
>
>
> class MyTestClass extends FunSuite with MySparkTestTrait {
> test("SomeTest") {
> // use spark implicits without needing to do import
> spark.implicits._
> }
> }
>
> Is there a reason for this being an abstract class?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Am I crazy, or does the binary distro not have Kafka integration?

2018-08-04 Thread Jacek Laskowski

Hi Sean,

It's been for years I'd say that you had to specify --packages to get the
Kafka-related jars on the classpath. I simply got used to this annoyance
(as did others). Could it be that it's an external package (although an
integral part of Spark)?!

I'm very glad you've brought it up since I think Kafka data source is so
important that it should be included in spark-shell and spark-submit by
default. THANKS!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen  wrote:

> Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- I
> provisionally marked this a Blocker, as if it's correct, then the release
> is missing an important piece and we'll want to remedy that ASAP. I still
> have this feeling I am missing something. The classes really aren't there
> in the release but ... *nobody* noticed all this time? I guess maybe
> Spark-Kafka users may be using a vendor distro that does package these bits.
>
>
> On Sat, Aug 4, 2018 at 10:48 AM Sean Owen  wrote:
>
>> I was debugging why a Kafka-based streaming app doesn't seem to find
>> Kafka-related integration classes when run standalone from our latest 2.3.1
>> release, and noticed that there doesn't seem to be any Kafka-related jars
>> from Spark in the distro. In jars/, I see:
>>
>> spark-catalyst_2.11-2.3.1.jar
>> spark-core_2.11-2.3.1.jar
>> spark-graphx_2.11-2.3.1.jar
>> spark-hive-thriftserver_2.11-2.3.1.jar
>> spark-hive_2.11-2.3.1.jar
>> spark-kubernetes_2.11-2.3.1.jar
>> spark-kvstore_2.11-2.3.1.jar
>> spark-launcher_2.11-2.3.1.jar
>> spark-mesos_2.11-2.3.1.jar
>> spark-mllib-local_2.11-2.3.1.jar
>> spark-mllib_2.11-2.3.1.jar
>> spark-network-common_2.11-2.3.1.jar
>> spark-network-shuffle_2.11-2.3.1.jar
>> spark-repl_2.11-2.3.1.jar
>> spark-sketch_2.11-2.3.1.jar
>> spark-sql_2.11-2.3.1.jar
>> spark-streaming_2.11-2.3.1.jar
>> spark-tags_2.11-2.3.1.jar
>> spark-unsafe_2.11-2.3.1.jar
>> spark-yarn_2.11-2.3.1.jar
>>
>> I checked make-distribution.sh, and it copies a bunch of JARs into the
>> distro, but does not seem to touch the kafka modules.
>>
>> Am I crazy or missing something obvious -- those should be in the
>> release, right?
>>
>

Qs on Dataset API -- groups of createXXXTempViews and XXXcheckpoint methods

2018-07-26 Thread Jacek Laskowski

Hi,

I'd appreciate your help on the following two questions about Dataset API:

1. Why do Dataset methods:
createTempView, createOrReplaceTempView, createGlobalTempView
and createOrReplaceGlobalTempView not return a DataFrame? They seem to be
neither actions nor transformations (and probably the reason for a separate
group basic). I wonder why they don't return a DataFrame? Where's the catch?

2. Why are localCheckpoint and checkpoint in the basic group? They'd be
fine to be with untyped transformations since they do transform a Dataset
into a DataFrame. Again, a bit of help would be very helpful.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

2018-07-19 Thread Jacek Laskowski

Hi Joseph,

Thanks for your explanation. It makes a lot of sense and I found
http://spark.apache.org/docs/latest/sql-programming-
guide.html#jdbc-to-other-databases giving more.

With that and after I reviewed the code, customSchema option is simply to
override the data type of the fields in a relation schema [1][2]. I think
the name of the option should be different with the word "override" to give
the exact meaning, shouldn't it?

With that said, I think the description of customSchema option may slightly
be incorrect. For example the following says:

"The custom schema to use for reading data from JDBC connectors"

and although it is used for reading it merely overrides the data types and
may not match the fields at all which makes no difference. Is that correct?

It's in the following sentence where the word of "type" appears:

"You can also specify partial fields, and the others use the default type
mapping."

But that begs for another question about "the default type mapping". What
the default type mapping is? That was one of my questions when I first
found the option.

What do you think about the following description of the customSchema
option. You're welcome to make further changes if needed.

customSchema - Specifies the custom data types of the read schema (that is
used at load time).

customSchema is a comma-separated list of field definitions with column
names and their data types in a canonical SQL representation, e.g. id
DECIMAL(38, 0), name STRING.

customSchema defines the data types of the columns that will override the
data types inferred from the table schema and follows the following pattern:

colTypeList
: colType (',' colType)*
;

colType
: identifier dataType (COMMENT STRING)?
;

dataType
: complex=ARRAY '<' dataType '>'
#complexDataType
| complex=MAP '<' dataType ',' dataType '>'
 #complexDataType
| complex=STRUCT ('<' complexColTypeList? '>' | NEQ)
#complexDataType
| identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')?
#primitiveDataType
;

Should I file a JIRA task for this?

[1] https://github.com/apache/spark/blob/v2.3.1/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.
scala?utf8=%E2%9C%93#L116-L118
[2] https://github.com/apache/spark/blob/v2.3.1/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
scala#L785-L788

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jul 16, 2018 at 4:27 PM, Joseph Torres  wrote:

> I guess the question is partly about the semantics of
> DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will
> definitely have exactly this schema", that doesn't quite match the behavior
> of the customSchema option. If it's only meant to be an arbitrary schema
> input which the source can interpret however it wants, it'd be fine.
>
> The second semantic is IMO more useful, so I'm in favor here.
>
> On Mon, Jul 16, 2018 at 3:43 AM, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I think there is a sort of inconsistency in how DataFrameReader.jdbc
>> deals with a user-defined schema as it makes sure that there's no
>> user-specified schema [1][2] yet allows for setting one using customSchema
>> option [3]. Why is so? Has this been merely overlooked or similar?
>>
>> I think assertNoSpecifiedSchema should be removed from
>> DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should
>> be added (with the customSchema option marked as deprecated to be removed
>> in 2.4 or 3.0).
>>
>> Should I file an issue in Spark JIRA and do the changes? WDYT?
>>
>> [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/DataFrameReader.scala?
>> utf8=%E2%9C%93#L249
>> [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/DataFrameReader.scala?
>> utf8=%E2%9C%93#L320
>> [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/execution/datasources/
>> jdbc/JDBCOptions.scala#L167
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>
>

JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

2018-07-16 Thread Jacek Laskowski

Hi,

I think there is a sort of inconsistency in how DataFrameReader.jdbc deals
with a user-defined schema as it makes sure that there's no user-specified
schema [1][2] yet allows for setting one using customSchema option [3]. Why
is so? Has this been merely overlooked or similar?

I think assertNoSpecifiedSchema should be removed from DataFrameReader.jdbc
and support for DataFrameReader.schema for jdbc should be added (with
the customSchema option marked as deprecated to be removed in 2.4 or 3.0).

Should I file an issue in Spark JIRA and do the changes? WDYT?

[1]
https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala?utf8=%E2%9C%93#L249
[2]
https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala?utf8=%E2%9C%93#L320
[3]
https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala#L167

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: [ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-14 Thread Jacek Laskowski

Hi Marcelo,

How to announce it on twitter @ https://twitter.com/apachespark? How to
make it part of the release process?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jun 11, 2018 at 9:47 PM, Marcelo Vanzin  wrote:

> We are happy to announce the availability of Spark 2.3.1!
>
> Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
> maintenance branch of Spark. We strongly recommend all 2.3.x users to
> upgrade to this stable release.
>
> To download Spark 2.3.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-3-1.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-31 Thread Jacek Laskowski

Yay! That's right!!! Thanks Reynold. Such a short answer with so much
information. Thanks.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Wed, May 30, 2018 at 8:10 PM, Reynold Xin  wrote:

> SQL expressions?
>
> On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I've been exploring RuntimeReplaceable expressions [1] and have been
>> wondering what their purpose is.
>>
>> Quoting the scaladoc [2]:
>>
>> > An expression that gets replaced at runtime (currently by the
>> optimizer) into a different expression for evaluation. This is mainly used
>> to provide compatibility with other databases.
>>
>> For example, ParseToTimestamp expression is a RuntimeReplaceable
>> expression and it is replaced by Cast(left, TimestampType)
>> or Cast(UnixTimestamp(left, format), TimestampType) per to_timestamp
>> function (there are two variants).
>>
>> My question is why is this RuntimeReplaceable better than simply using
>> the Casts as the implementation of to_timestamp functions?
>>
>> def to_timestamp(s: Column, fmt: String): Column = withExpr {
>>   // pseudocode
>>   Cast(UnixTimestamp(left, format), TimestampType)
>> }
>>
>> What's wrong with the above implementation compared to the current one?
>>
>> [1] https://github.com/apache/spark/blob/master/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
>> expressions/Expression.scala#L275
>>
>> [2] https://github.com/apache/spark/blob/master/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
>> expressions/Expression.scala#L266-L267
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>

[SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Jacek Laskowski

Hi,

I've been exploring RuntimeReplaceable expressions [1] and have been
wondering what their purpose is.

Quoting the scaladoc [2]:

> An expression that gets replaced at runtime (currently by the optimizer)
into a different expression for evaluation. This is mainly used to provide
compatibility with other databases.

For example, ParseToTimestamp expression is a RuntimeReplaceable expression
and it is replaced by Cast(left, TimestampType) or Cast(UnixTimestamp(left,
format), TimestampType) per to_timestamp function (there are two variants).

My question is why is this RuntimeReplaceable better than simply using the
Casts as the implementation of to_timestamp functions?

def to_timestamp(s: Column, fmt: String): Column = withExpr {
  // pseudocode
  Cast(UnixTimestamp(left, format), TimestampType)
}

What's wrong with the above implementation compared to the current one?

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L275

[2]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L266-L267

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: [SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)

2018-05-28 Thread Jacek Laskowski

Thanks Herman! You've helped me a lot (and am going to use your fine
explanation in my gitbook quoting when needed! :))

p.s. I also found this today that helped me too -->
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/1483312212640900/6987336228780374/latest.html

What about tests? Don't you think the method should have tests? I'm writing
small code snippets anyway while exploring it and have been wondering how
to contribute it back to the Spark project given the method is private. It
looks like I should instead be focusing on the methods of Expression or
even QueryPlan to understand the various methods (as that's what triggered
my question).

Thanks.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, May 28, 2018 at 11:33 AM, Herman van Hövell tot Westerflier <
her...@databricks.com> wrote:

> Hi Jacek,
>
> RewriteCorrelatedScalarSubquery rewrites a plan containing a scalar
> subquery into a left join and a projection/filter/aggregate.
>
> For example:
> SELECT a_id,
>(SELECT MAX(value)
> FROM tbl_b
> WHERE tbl_b.b_id = tbl_a.a_id) AS max_value
> FROM  tbl_a
>
> Will be rewritten into something like this:
> SELECT a_id,
>agg.max_value
> FROM  tbl_a
>   JOIN (SELECT   b_id,
>  MAX(value) AS max_value
> FROM tbl_b
> GROUP BY b_id) agg
>ON agg.b_id = tbl_a.a_id
>
> The particular function you refer to extracts all correlated scalar
> subqueries by adding them to the subqueries buffer and rewrites the
> expression to use output of the subquery (the s.plan.output.head bit).
> This returned expression then replaces the original expression in the
> operator (Aggregate, Project or Filter) it came from, completing the
> rewrite for that operator. The extracted subqueries are planned as LEFT
> OUTER JOINS below the operator.
>
> I hope this makes sense.
>
> Herman
>
>
>
>
>
> On Sun, May 27, 2018 at 9:43 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I'm trying to understand RewriteCorrelatedScalarSubquery optimization
>> and how extractCorrelatedScalarSubqueries [1] works. I don't understand
>> how "The expression is rewritten and returned." is done. How is the
>> expression rewritten?
>>
>> Since it's private it's not even possible to write tests and that got me
>> thinking how you go about code like this? How do you know whether it works
>> fine or not? Any help? I'd appreciate.
>>
>> [1] https://github.com/apache/spark/blob/branch-2.3/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
>> optimizer/subquery.scala?utf8=%E2%9C%93#L290-L299
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>

[SQL] Two ScalarSubquery expressions?! Could we have ScalarSubqueryExec instead?

2018-05-27 Thread Jacek Laskowski

Hi,

Just been bitten by finding out that there are two ScalarSubquery
expressions, one is a SubqueryExpression [1] and another is a
ExecSubqueryExpression
[2].

Could we have the ExecSubqueryExpression subclass called ScalarSubqueryExec
to follow the naming convention in logical and physical plans? That'd help
a lot (and I doubt I'm the only one confused). Thanks.

[1]
https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala?utf8=%E2%9C%93#L246
[2]
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala?utf8=%E2%9C%93#L46

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

[SQL] Understanding RewriteCorrelatedScalarSubquery optimization (and TreeNode.transform)

2018-05-27 Thread Jacek Laskowski

Hi,

I'm trying to understand RewriteCorrelatedScalarSubquery optimization and
how extractCorrelatedScalarSubqueries [1] works. I don't understand how
"The expression is rewritten and returned." is done. How is the expression
rewritten?

Since it's private it's not even possible to write tests and that got me
thinking how you go about code like this? How do you know whether it works
fine or not? Any help? I'd appreciate.

[1]
https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala?utf8=%E2%9C%93#L290-L299

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: Spark version for Mesos 0.27.0

2018-05-25 Thread Jacek Laskowski

Hi,

Mesos 0.27.0?! That's been a while. I'd search for the changes to pom.xml
and see when the mesos dependency version changed. That'd give you the most
precise answer. I think it could've been 1.5 or older.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Fri, May 25, 2018 at 1:29 PM, Thodoris Zois  wrote:

> Hello,
>
> Could you please tell me which version of Spark works with Apache Mesos
>  version 0.27.0? (I cannot find anything on docs at github)
>
> Thank you very much,
> Thodoris Zois
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Repeated FileSourceScanExec.metrics from ColumnarBatchScan.metrics

2018-05-22 Thread Jacek Laskowski

Hi,

I'm wondering why are the metrics repeated in FileSourceScanExec.metrics
[1] since it is a ColumnarBatchScan [2] and so inherits the two
metrics numOutputRows and scanTime from ColumnarBatchScan.metrics [3].

Shouldn't FileSourceScanExec.metrics be as follows then:

  override lazy val metrics = super.metrics ++ Map(
"numFiles" -> SQLMetrics.createMetric(sparkContext, "number of files"),
"metadataTime" -> SQLMetrics.createMetric(sparkContext, "metadata time
(ms)"))

I'd like to send a pull request with a fix if no one objects. Anyone?

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L315-L319
[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L164
[3]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L38-L40

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

InMemoryTableScanExec.inputRDD and buffers (RDD[CachedBatch])

2018-05-14 Thread Jacek Laskowski

Hi,

Is there any reason why InMemoryTableScanExec.inputRDD does not use the
buffers local value [1] for the non-batch case [2]? Just curious as I ran
into it and thought I'd do a tiny refactoring.

[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala?utf8=%E2%9C%93#L105
[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala?utf8=%E2%9C%93#L125

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

1 2 3 4 >

1 - 100 of 358 matches

Mail list logo