Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread Yang,Jie(INF)
Does this happen when running all UTs? I ran this suite several times alone 
using OpenJDK(zulu) 8u322-b06 on my Mac, but no similar error occurred

发件人: Sean Owen 
日期: 2022年7月12日 星期二 10:45
收件人: Dongjoon Hyun 
抄送: dev 
主题: Re: [VOTE] Release Spark 3.2.2 (RC1)

Is anyone seeing this error? I'm on OpenJDK 8 on a Mac:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000101ca8ace, pid=11962, tid=0x1603
#
# JRE version: OpenJDK Runtime Environment (8.0_322) (build 
1.8.0_322-bre_2022_02_28_15_01-b00)
# Java VM: OpenJDK 64-Bit Server VM (25.322-b00 mixed mode bsd-amd64 compressed 
oops)
# Problematic frame:
# V  [libjvm.dylib+0x549ace]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/spark-3.2.2/sql/core/hs_err_pid11962.log
ColumnVectorSuite:
- boolean
- byte
Compiled method (nm)  885897 75403 n 0   sun.misc.Unsafe::putShort 
(native)
 total in heap  [0x000102fdaa10,0x000102fdad48] = 824
 relocation [0x000102fdab38,0x000102fdab78] = 64
 main code  [0x000102fdab80,0x000102fdad48] = 456
Compiled method (nm)  885897 75403 n 0   sun.misc.Unsafe::putShort 
(native)
 total in heap  [0x000102fdaa10,0x000102fdad48] = 824
 relocation [0x000102fdab38,0x000102fdab78] = 64
 main code  [0x000102fdab80,0x000102fdad48] = 456

On Mon, Jul 11, 2022 at 4:58 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 3.2.2.

The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC 
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.2.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see 
https://spark.apache.org/

The tag to be voted on is v3.2.2-rc1 (commit 
78a5825fe266c0884d2dd18cbca9625fa258d7f7):
https://github.com/apache/spark/tree/v3.2.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1409/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/

The list of bug fixes going into 3.2.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12351232

This release is using the release script of the tag v3.2.2-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.2.2?
===

The current list of open tickets targeted at 3.2.2 can be found at:
https://issues.apache.org/jira/projects/SPARK
 and search for "Target Version/s" = 3.2.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.


Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread Sean Owen
Is anyone seeing this error? I'm on OpenJDK 8 on a Mac:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000101ca8ace, pid=11962, tid=0x1603
#
# JRE version: OpenJDK Runtime Environment (8.0_322) (build
1.8.0_322-bre_2022_02_28_15_01-b00)
# Java VM: OpenJDK 64-Bit Server VM (25.322-b00 mixed mode bsd-amd64
compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x549ace]
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/spark-3.2.2/sql/core/hs_err_pid11962.log
ColumnVectorSuite:
- boolean
- byte
Compiled method (nm)  885897 75403 n 0   sun.misc.Unsafe::putShort
(native)
 total in heap  [0x000102fdaa10,0x000102fdad48] = 824
 relocation [0x000102fdab38,0x000102fdab78] = 64
 main code  [0x000102fdab80,0x000102fdad48] = 456
Compiled method (nm)  885897 75403 n 0   sun.misc.Unsafe::putShort
(native)
 total in heap  [0x000102fdaa10,0x000102fdad48] = 824
 relocation [0x000102fdab38,0x000102fdab78] = 64
 main code  [0x000102fdab80,0x000102fdad48] = 456

On Mon, Jul 11, 2022 at 4:58 PM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.2.
>
> The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.2.2-rc1 (commit
> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
> https://github.com/apache/spark/tree/v3.2.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1409/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>
> The list of bug fixes going into 3.2.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351232
>
> This release is using the release script of the tag v3.2.2-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.2?
> ===
>
> The current list of open tickets targeted at 3.2.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> Dongjoon
>


Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread L. C. Hsieh
+1

On Mon, Jul 11, 2022 at 4:50 PM Hyukjin Kwon  wrote:
>
> +1
>
> On Tue, 12 Jul 2022 at 06:58, Dongjoon Hyun  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.2.2.
>>
>> The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC 
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.2.2-rc1 (commit 
>> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
>> https://github.com/apache/spark/tree/v3.2.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1409/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>>
>> The list of bug fixes going into 3.2.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351232
>>
>> This release is using the release script of the tag v3.2.2-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.2?
>> ===
>>
>> The current list of open tickets targeted at 3.2.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.2.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>> Dongjoon

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-11 Thread Jungtaek Lim
Final reminder. I'll leave this thread for a couple of days to see further
voices, and go forward if there is no outstanding comment.

On Sat, Jul 9, 2022 at 9:54 PM Jungtaek Lim 
wrote:

> It sounds like none of the approaches perfectly solve the issue of
> backfill.
>
> 1. Trigger.Once: scale issue
> 2. Trigger.AvailbleNow: watermark advancement issue (data getting dropped
> due to watermark) depending on the order of data
> 3. Manual batch: state is not built from processing backfill
>
> Handling a huge data (backfill) with a single microbatch without advancing
> the watermark also requires thinking of "backfill-specific" situations -
> state can grow unexpectedly since there is no way to purge without
> watermark advancement. There seems to be not really a good approach to
> solve all of the issues smoothly. One easier way as of now is to use
> RocksDB state store provider to tolerate the huge size of state while we
> enforce to not advance watermark, but the ideal approach still really
> depends on the data source and the volume of the data to backfill.
>
> Btw, don't worry if you get a feeling the deprecated API may get removed
> too soon! Removing the API would require another serious discussion and
> Spark community is generally not in favor of removing existing API.
>
> 2022년 7월 8일 (금) 오후 11:21, Adam Binford 님이 작성:
>
> Dang I was hoping it was the second one. In our case the data is too large
>> to run the whole backfill for the aggregation in a single batch (the
>> shuffle is too big). We currently resort to manually batching (i.e. not
>> streaming) the backlog (anything older than the watermark) when we need to
>> reprocess, because we can't really know for sure our batches are processed
>> in the correct event time order when starting from scratch.
>>
>> I'm not against deprecating Trigger.Once, just wanted to chime in that
>> someone was using it! I'm itching to upgrade and try out the new stuff.
>>
>> Adam
>>
>> On Fri, Jul 8, 2022 at 9:16 AM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for the input, Adam! Replying inline.
>>>
>>> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford  wrote:
>>>
 We use Trigger.Once a lot, usually for backfilling data for new
 streams. I feel like I could see a continuing use case for "ignore trigger
 limits for this batch" (ignoring the whole issue with re-running the last
 failed batch vs a new batch), but we haven't actually been able to upgrade
 yet and try out Trigger.AvailableNow, so that could end up replacing all
 our use cases.

 One question I did have is how it does (or is supposed to) handle
 watermarking. Is the watermark determined for each batch independently like
 a normal stream, or is it kept constant for all batches in a single
 AvailableNow run? For example, we have a stateful job that we need to rerun
 occasionally, and it takes ~6 batches to backfill all the data before
 catching up to live data. With a Trigger.Once we know we won't accidentally
 drop any data due to the watermark when backfilling, because it's a single
 batch with no watermark yet. Would the same hold true if we backfill with
 Trigger.AvailableNow instead?

>>>
>>> The behavior is the former one. Each batch advances the watermark and
>>> it's immediately reflected on the next batch.
>>>
>>> The number of batches Trigger.AvailableNow will execute depends on the
>>> data source and the source option. For example, if you use Kafka data
>>> source and use Trigger.AvailableNow without specifying any source option on
>>> limiting the size, Trigger.AvailableNow will process all newly available
>>> data as a single microbatch. It may not be still a single microbatch - it
>>> would also handle the batch already logged in WAL first if any, as well as
>>> handle no-data batch after the run of all microbatches. But I guess these
>>> additional batches wouldn't hurt your case.
>>>
>>> If the data source doesn't allow processing all available data within a
>>> single microbatch (depending on the implementation of default read limit),
>>> you could probably either 1) set source options regarding to limit size as
>>> an unrealistic one to enforce a single batch or 2) set the delay of
>>> watermark as an unrealistic one. Both of the workarounds require you to use
>>> different source options/watermark configuration for backfill vs normal run
>>> - I agree it wouldn’t be a smooth one.
>>>
>>> This proposal does not aim to remove Trigger.Once in near future. As
>>> long as we deprecate Trigger.Once, we would get some reports for use cases
>>> Trigger.Once may work better (like your case) for the time period across
>>> several minor releases, and then we can really decide. (IMHO, handling
>>> backfill with Trigger.Once sounds to me as a workaround. Backfill may
>>> warrant its own design to deal with.)
>>>
>>>

 Adam

 On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Bump 

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread Hyukjin Kwon
+1

On Tue, 12 Jul 2022 at 06:58, Dongjoon Hyun  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.2.
>
> The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.2.2-rc1 (commit
> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
> https://github.com/apache/spark/tree/v3.2.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1409/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>
> The list of bug fixes going into 3.2.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351232
>
> This release is using the release script of the tag v3.2.2-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.2?
> ===
>
> The current list of open tickets targeted at 3.2.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> Dongjoon
>


[VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version
3.2.2.

The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.2.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.2.2-rc1 (commit
78a5825fe266c0884d2dd18cbca9625fa258d7f7):
https://github.com/apache/spark/tree/v3.2.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1409/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/

The list of bug fixes going into 3.2.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12351232

This release is using the release script of the tag v3.2.2-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.2.2?
===

The current list of open tickets targeted at 3.2.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.2.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Dongjoon