structured streaming join of streaming dataframe with static dataframe performance

2022-07-17 Thread Koert Kuipers
i was surprised to find out that if a streaming dataframe is joined with a
static dataframe, that the static dataframe is re-shuffled for every
microbatch, which adds considerable overhead.

wouldn't it make more sense to re-use the shuffle files?

or if that is not possible then load the static dataframe into the
statestore? this would turn the join into a lookup (in rocksdb)?

-- 
CONFIDENTIALITY NOTICE: This electronic communication and any files 
transmitted with it are confidential, privileged and intended solely for 
the use of the individual or entity to whom they are addressed. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution (electronic or otherwise) or forwarding of, or the 
taking of any action in reliance on the contents of this transmission is 
strictly prohibited. Please notify the sender immediately by e-mail if you 
have received this email by mistake and delete this email from your system.


Is it necessary to print this email? If you care about the environment 
like we do, please refrain from printing emails. It helps to keep the 
environment forested and litter-free.


CVE-2022-33891: Apache Spark shell command injection vulnerability via Spark UI

2022-07-17 Thread Sean Owen
Severity: important

Description:

The Apache Spark UI offers the possibility to enable ACLs via the
configuration option spark.acls.enable. With an authentication filter, this
checks whether a user has access permissions to view or modify the
application. If ACLs are enabled, a code path in HttpSecurityFilter can
allow someone to perform impersonation by providing an arbitrary user name.
A malicious user might then be able to reach a permission check function
that will ultimately build a Unix shell command based on their input, and
execute it. This will result in arbitrary shell command execution as the
user Spark is currently running as. This affects Apache Spark versions
3.0.3 and earlier, versions 3.1.1 to 3.1.2, and versions 3.2.0 to 3.2.1.

This issue is being tracked as SPARK-38992

Mitigation:

Upgrade to supported Apache Spark maintenance release 3.1.3, 3.2.2, or
3.3.0 or later

Credit:

 Kostya Kortchinsky (Databricks)


[ANNOUNCE] Apache Spark 3.2.2 released

2022-07-17 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.2!

Spark 3.2.2 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun