structured streaming join of streaming dataframe with static dataframe performance
i was surprised to find out that if a streaming dataframe is joined with a static dataframe, that the static dataframe is re-shuffled for every microbatch, which adds considerable overhead. wouldn't it make more sense to re-use the shuffle files? or if that is not possible then load the static dataframe into the statestore? this would turn the join into a lookup (in rocksdb)? -- CONFIDENTIALITY NOTICE: This electronic communication and any files transmitted with it are confidential, privileged and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on the contents of this transmission is strictly prohibited. Please notify the sender immediately by e-mail if you have received this email by mistake and delete this email from your system. Is it necessary to print this email? If you care about the environment like we do, please refrain from printing emails. It helps to keep the environment forested and litter-free.
CVE-2022-33891: Apache Spark shell command injection vulnerability via Spark UI
Severity: important Description: The Apache Spark UI offers the possibility to enable ACLs via the configuration option spark.acls.enable. With an authentication filter, this checks whether a user has access permissions to view or modify the application. If ACLs are enabled, a code path in HttpSecurityFilter can allow someone to perform impersonation by providing an arbitrary user name. A malicious user might then be able to reach a permission check function that will ultimately build a Unix shell command based on their input, and execute it. This will result in arbitrary shell command execution as the user Spark is currently running as. This affects Apache Spark versions 3.0.3 and earlier, versions 3.1.1 to 3.1.2, and versions 3.2.0 to 3.2.1. This issue is being tracked as SPARK-38992 Mitigation: Upgrade to supported Apache Spark maintenance release 3.1.3, 3.2.2, or 3.3.0 or later Credit: Kostya Kortchinsky (Databricks)
[ANNOUNCE] Apache Spark 3.2.2 released
We are happy to announce the availability of Apache Spark 3.2.2! Spark 3.2.2 is a maintenance release containing stability fixes. This release is based on the branch-3.2 maintenance branch of Spark. We strongly recommend all 3.2 users to upgrade to this stable release. To download Spark 3.2.2, head over to the download page: https://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-3-2-2.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. Dongjoon Hyun