Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
Hi, Yes, I tested all of them on spark 3.5. Regards, Andrzej pon., 26 lut 2024 o 23:24 Mich Talebzadeh napisał(a): > Hi, > > These are all on spark 3.5, correct? > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > >view my Linkedin

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
Hi, These are all on spark 3.5, correct? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera
Hey all, I've been using Structured Streaming in production for almost a year already and I want to share the bugs I found in this time. I created a test for each of the issues and put them all here: https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala I split the issues

Re: Bintray replacement for spark-packages.org

2024-02-25 Thread Richard Eggert
I've been trying to obtain clarification on the terms of use regarding repo.spark-packages.org. I emailed feedb...@spark-packages.org two weeks ago, but have not heard back. Whom should I contact? On Mon, Apr 26, 2021 at 8:13 AM Bo Zhang wrote: > Hi Apache Spark users, > > As you might know,

Issue of spark with antlr version

2024-02-25 Thread Chawla, Parul
Hi Spark Team, Our application is currently using spring framrwork 5.3.31 .To upgrade it to 6.x , as per application dependency we must upgrade Spark and Hibernate jars as well . With Hibernate compatible upgrade, the dependent Antlr4 jar version has been upgraded to 4.10.1 but there's no

unsubscribe

2024-02-24 Thread Ameet Kini

Re: job uuid not unique

2024-02-24 Thread Xin Zhang
unsubscribe On Sat, Feb 17, 2024 at 3:04 AM Рамик И wrote: > > Hi > I'm using Spark Streaming to read from Kafka and write to S3. Sometimes I > get errors when writing org.apache.hadoop.fs.FileAlreadyExistsException. > > Spark version: 3.5.0 > scala version : 2.13.8 > Cluster: k8s > >

Re: AQE coalesce 60G shuffle data into a single partition

2024-02-24 Thread Enrico Minack
Hi Shay, maybe this is related to the small number of output rows (1,250) of the last exchange step that consume those 60GB shuffle data. Looks like your outer transformation is something like df.groupBy($"id").agg(collect_list($"prop_name")) Have you tried adding a repartition as an attempt

Re: [Beginner Debug]: Executor OutOfMemoryError

2024-02-23 Thread Mich Talebzadeh
Seems like you are having memory issues. Examine your settings. 1. It appears that your driver memory setting is too high. It should be a fraction of total memy provided by YARN 2. Use the Spark UI to monitor the job's memory consumption. Check the Storage tab to see how memory is

[Beginner Debug]: Executor OutOfMemoryError

2024-02-22 Thread Shawn Ligocki
Hi I'm new to Spark and I'm running into a lot of OOM issues while trying to scale up my first Spark application. I am running into these issues with only 1% of the final expected data size. Can anyone help me understand how to properly configure Spark to use limited memory or how to debug which

Re: unsubscribe

2024-02-21 Thread Xin Zhang
unsubscribe On Tue, Feb 20, 2024 at 9:44 PM kritika jain wrote: > Unsubscribe > > On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, > wrote: > >> >> unsubscribe >> >> >> -- Zhang Xin(张欣) Email:josseph.zh...@gmail.com

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh
Indeed valid points raised including the potential typo in the new spark version. I suggest, in the meantime, you should look for the so called alternative debugging methods - - Simpler explain(), try basic explain() or explain("extended"). This might provide a less detailed, but

Kafka-based Spark Streaming and Vertex AI for Sentiment Analysis

2024-02-21 Thread Mich Talebzadeh
I am working on a pet project to implement a real-time sentiment analysis system for analyzing customer reviews. It leverages Kafka for data ingestion, Spark Structured Streaming (SSS) for real-time processing, and Vertex AI for sentiment analysis and potential action triggers. *Features* -

[ANNOUNCE] Apache Kyuubi 1.8.1 is available

2024-02-20 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.1 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC and RESTful

Re: Spark 3.3 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Apologies. Issue is seen after we upgraded from Spark 3.1 to Spark 3.3. The same query runs fine on Spark 3.1. Omit the Spark version mentioned in email subject earlier. Anup Error trace: query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Hi Spark team, We ran into a dataframe issue after upgrading from spark 3.1 to 4. query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\" raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-20 Thread Manoj Kumar
Dear @Chao Sun, I trust you're doing well. Having worked extensively with Spark Nvidia Rapids, Velox, and Gluten, I'm now contemplating Comet's potential advantages over Velox in terms of performance and unique features. While Rapids leverages GPUs effectively, Gazelle's Intel AVX512 intrinsics

Re: unsubscribe

2024-02-20 Thread kritika jain
Unsubscribe On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, wrote: > > unsubscribe > > >

unsubscribe

2024-02-20 Thread Крюков Виталий Семенович
unsubscribe

Community Over Code Asia 2024 Travel Assistance Applications now open!

2024-02-20 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code Asia 2024 are now open! We will be supporting Community over Code Asia, Hangzhou, China July 26th - 28th, 2024. TAC exists

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Thanks for your kind words Sri Well it is true that as yet spark on kubernetes is not on-par with spark on YARN in maturity and essentially spark on kubernetes is still work in progress.* So in the first place IMO one needs to think why executors are failing. What causes this behaviour? Is it the

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this feature to K8s. [1] https://issues.apache.org/jira/browse/SPARK-41210 [2] https://github.com/apache/spark/pull/38732 Thanks, Cheng Pan > On

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Dear Mich, Thank you for your detailed response and the suggested approach to handling retry logic. I appreciate you taking the time to outline the method of embedding custom retry mechanisms directly into the application code. While the solution of wrapping the main logic of the Spark job in a

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Went through your issue with the code running on k8s When an executor of a Spark application fails, the system attempts to maintain the desired level of parallelism by automatically recreating a new executor to replace the failed one. While this behavior is beneficial for transient errors,

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Not that I am aware of any configuration parameter in Spark classic to limit executor creation. Because of fault tolerance Spark will try to recreate failed executors. Not really that familiar with the Spark operator for k8s. There may be something there. Have you considered custom monitoring and

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich, > Also have you got some benchmark results from your tests that you can possibly share? We only have some partial benchmark results internally so far. Once shuffle and better memory management have been introduced, we plan to publish the benchmark results (at least TPC-H) in the repo.

[Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Hello Spark Community, I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, for running various Spark applications. While the system generally works well, I've encountered a challenge related to how Spark applications handle executor failures, specifically in scenarios

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Yes I have gone through it. So let's give me the setup. More context - My jar file is in java language On Mon, Feb 19, 2024, 8:53 PM Mich Talebzadeh wrote: > Sure but first it would be beneficial to understand the way Spark works on > Kubernetes and the concept.s > > Have a look at this article

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
I am not using any private docker image. Only I am running the jar file in EMR using spark-submit command so now I want to run this jar file in eks so can you please tell me how can I set-up for this ?? On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi < jagannath.ma...@cloud.cbnits.com> wrote: >

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Sure but first it would be beneficial to understand the way Spark works on Kubernetes and the concept.s Have a look at this article of mine Spark on Kubernetes, A Practitioner’s Guide

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
OK you have a jar file that you want to work with when running using Spark on k8s as the execution engine (EKS) as opposed to YARN on EMR as the execution engine? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Where is your docker file? In ECR container registry. If you are going to use EKS, then it need to be accessible to all nodes of cluster When you build your docker image, put your jar under the $SPARK_HOME directory. Then add a line to your docker build file as below Here I am accessing Google

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Richard Smith
I run my Spark jobs in GCP with Google Dataproc using GCS buckets. I've not used AWS, but its EMR product offers similar functionality to Dataproc. The title of your post implies your Spark cluster runs on EKS. You might be better off using EMR, see links below: EMR

Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Dear Spark Community, I hope this email finds you well. I am reaching out to seek assistance and guidance regarding a task I'm currently working on involving Apache Spark. I have developed a JAR file that contains some Spark applications and functionality, and I need to run this JAR file within

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Mich Talebzadeh
OK got it Someone asked a similar but not related to shuffle question in Spark slack channel.. This is a simple Python code that creates shuffle files in shuffle_directory = "/tmp/spark_shuffles" and simulates working examples using a loop and periodically cleans up shuffle files older than 1

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Saha, Daniel
Thanks for the suggestions Mich, Jörn, and Adam. The rationale for long-lived app with loop versus submitting multiple yarn applications is mainly for simplicity. Plan to run app on an multi-tenant EMR cluster alongside other yarn apps. Implementing the loop outside the Spark app will work but

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-18 Thread Mich Talebzadeh
Hi, What do you propose or you think will help when these spark jobs are independent of each other --> So once a job/iterator is complete, there is no need to retain these shuffle files. You have a number of options to consider starting from spark configuration parameters and so forth

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Jörn Franke
You can try to shuffle to s3 using the cloud shuffle plugin for s3 (https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/) - the performance of the new plugin is for many spark jobs sufficient (it works also on EMR). Then you can use s3 lifecycle

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Adam Binford
If you're using dynamic allocation it could be caused by executors with shuffle data being deallocated before the shuffle is cleaned up. These shuffle files will never get cleaned up once that happens until the Yarn application ends. This was a big issue for us so I added support for deleting

Re: job uuid not unique

2024-02-16 Thread Mich Talebzadeh
As a bare minimum you will need to add some error trapping and exception handling! scala> import org.apache.hadoop.fs.FileAlreadyExistsException import org.apache.hadoop.fs.FileAlreadyExistsException and try your code try { df .coalesce(1) .write

Effectively append the dataset to avro directory

2024-02-16 Thread Rushikesh Kavar
Hello Community, I checked below issue in various platforms but I could not get satisfactory answer. I am using spark java. I am having large data cluster. My application is making more than 10 API calls. Each calls returns a java list. Each list item is of same structure (i.e. same java class)

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao, As a cool feature - Compared to standard Spark, what kind of performance gains can be expected with Comet? - Can one use Comet on k8s in conjunction with something like a Volcano addon? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in the form expected although I am aware of the shell script. Also have you got some benchmark results from your tests that you can possibly share? Thanks, Mich Talebzadeh, Dad | Technologist | Solutions Architect |

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen, We will add a "Getting Started" section in the README soon, but basically comet-spark-shell in the repo should provide a basic tool to build Comet and launch a Spark shell with it. Note that we haven't

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread praveen sinha
Hi Chao, Is there any example app/gist/repo which can help me use this plugin. I wanted to try out some realtime aggregate performance on top of parquet and spark dataframes. Thanks and Regards Praveen On Wed, Feb 14, 2024 at 9:20 AM Chao Sun wrote: > > Out of interest what are the

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and > Glutten? Overall they are similar, although Gluten supports multiple backends including Velox and Clickhouse. One major difference is (obviously) Comet is based on DataFusion and Arrow, and written in Rust, while

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work! On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu wrote: > Absolutely thrilled to see the project going open-source! Huge congrats to > Chao and the entire team on this milestone! > > Yufei > > > On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > >> Hi all, >> >> We are

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Yufei Gu
Absolutely thrilled to see the project going open-source! Huge congrats to Chao and the entire team on this milestone! Yufei On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Mich Talebzadeh
You are getting DiskChecker$DiskErrorExceptionerror when no new records are published to Kafka for a few days. The error indicates that the Spark application could not find a valid local directory to create temporary files for data processing. This mightbe due to any of these - if no records

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion

Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun
Hi all, We are very happy to announce that Project Comet, a plugin to accelerate Spark query execution via leveraging DataFusion and Arrow, has now been open sourced under the Apache Arrow umbrella. Please check the project repo https://github.com/apache/arrow-datafusion-comet for more details if

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Bjørn Jørgensen
DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001- out of space? tir. 13. feb. 2024 kl. 21:24 skrev Abhishek Singla < abhisheksingla...@gmail.com>: > Hi Team, > > Could someone provide some insights into this issue? > > Regards, > Abhishek Singla > > On

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Abhishek Singla
Hi Team, Could someone provide some insights into this issue? Regards, Abhishek Singla On Wed, Jan 17, 2024 at 11:45 PM Abhishek Singla < abhisheksingla...@gmail.com> wrote: > Hi Team, > > Version: 3.2.2 > Java Version: 1.8.0_211 > Scala Version: 2.12.15 > Cluster: Standalone > > I am using

Re: Null pointer exception while replying WAL

2024-02-12 Thread Mich Talebzadeh
OK Getting Null pointer exception while replying WAL! One possible reason is that the messages RDD might contain null elements, and attempting to read JSON from null values can result in an NPE. To handle this, you can add a filter before processing the RDD to remove null elements.

Re: Null pointer exception while replying WAL

2024-02-12 Thread nayan sharma
Please find below code def main(args: Array[String]): Unit = { val config: Config = ConfigFactory.load() val streamC = StreamingContext.getOrCreate( checkpointDirectory, () => functionToCreateContext(config, checkpointDirectory) ) streamC.start()

Re: Null pointer exception while replying WAL

2024-02-11 Thread Mich Talebzadeh
Hi, It is challenging to make a recommendation without further details. I am guessing you are trying to build a fault-tolerant spark application (spark structured streaming) that consumes messages from Solace? To address *NullPointerException* in the context of the provided information, you need

Null pointer exception while replying WAL

2024-02-09 Thread nayan sharma
Hi Users, I am trying to build fault tolerant spark solace consumer. Issue :- we have to take restart of the job due to multiple issue load average is one of them. At that time whatever spark is processing or batches in the queue is lost. We can't replay it because we already had send ack while

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
The full code is available from the link below https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
Appreciate your thoughts on this, Personally I think Spark Structured Streaming can be used effectively in an Event Driven Architecture as well as continuous streaming) >From the link here

performance of union vs insert into

2024-02-08 Thread Manish Mehra
Hello, I have an observation wherein performance of 'union' is lower when compared to multiple 'insert into' statements. Is this in line with Spark best practice? Regards Manish Mehra

[ANNOUNCE] Apache Celeborn(incubating) 0.4.0 available

2024-02-06 Thread Fu Chen
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.4.0. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Community over Code EU 2024 Travel Assistance Applications now open!

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

[no subject]

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
I agree with what is stated. This is the gist of my understanding having tested it. When working with Spark Structured Streaming, each streaming query runs in its own separate Spark session to ensure isolation and avoid conflicts between different queries. So here I have: def process_data(self,

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
hm. In your logic here def process_micro_batch(micro_batch_df, batchId) : micro_batch_df.createOrReplaceTempView("temp_view") df = spark.sql(f"select * from temp_view") return df Is this function called and if so do you check if micro_batch_df contains rows -> if

deploy spark as cluster

2024-01-31 Thread ali sharifi
Hi everyone! I followed this guide https://dev.to/mvillarrealb/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-2021-update-6l4 to create a Spark cluster on an Ubuntu server with Docker. However, when I try to submit my PySpark code to the master, the jobs are registered in the

Create Custom Logs

2024-01-31 Thread PRASHANT L
Hi I justed wanted to check if there is a way to create custom log in Spark I want to write selective/custom log messages to S3 , running spark submit on EMR I would not want all the spark generated logs ... I would just need the log messages that are logged as part of Spark Application

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Jungtaek Lim
Hi, Streaming query clones the spark session - when you create a temp view from DataFrame, the temp view is created under the cloned session. You will need to use micro_batch_df.sparkSession to access the cloned session. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jan 31, 2024 at 3:29 PM

randomsplit has issue?

2024-01-31 Thread second_co...@yahoo.com.INVALID
based on this blog post https://sergei-ivanov.medium.com/why-you-should-not-use-randomsplit-in-pyspark-to-split-data-into-train-and-test-58576d539a36 , I noticed a recommendation against using randomSplit for data splitting due to data sorting. Is the information provided in the blog accurate?

Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-30 Thread Karthick Nk
Hi Team, I am using structered streaming in pyspark in azure Databricks, in that I am creating temp_view from dataframe (df.createOrReplaceTempView('temp_view')) for performing spark sql query transformation. In that I am facing the issue that temp_view not found, so that as a workaround i have

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn
Hi, I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into an issue with some of our queries that read from PostgreSQL databases. Any attempt to run a Spark SQL query that selects a bpchar without a length specifier from the source DB seems to crash:

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-29 Thread Mich Talebzadeh
As I stated earlier on,, there are alternatives that you might explore socket sources for testing purposes. from pyspark.sql import SparkSession from pyspark.sql.functions import expr, when from pyspark.sql.types import StructType, StructField, LongType spark = SparkSession.builder \

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-29 Thread Perfect Stranger
Yes, there's definitely an issue, can someone fix it? I'm not familiar with apache jira, do I need to make a bug report or what? On Mon, Jan 29, 2024 at 2:57 AM Mich Talebzadeh wrote: > OK > > This is the equivalent Python code > > from pyspark.sql import SparkSession > from

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-28 Thread Mich Talebzadeh
OK This is the equivalent Python code from pyspark.sql import SparkSession from pyspark.sql.functions import expr, when from pyspark.sql.types import StructType, StructField, LongType from datetime import datetime spark = SparkSession.builder \ .master("local[*]") \

startTimestamp doesn't work when using rate-micro-batch format

2024-01-28 Thread Perfect Stranger
I described the issue here: https://stackoverflow.com/questions/77893939/how-does-starttimestamp-option-work-for-the-rate-micro-batch-format Could someone please respond? The rate-micro-batch format doesn't seem to respect the startTimestamp option. Thanks.

subscribe

2024-01-26 Thread Sahib Aulakh
subscribe

subscribe

2024-01-26 Thread Sahib Aulakh

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-24 Thread Andrzej Zera
Hi, I'm sorry but I got confused about the inner workings of late events watermark. You're completely right. Thanks for clarifying. Regards, Andrzej czw., 11 sty 2024 o 13:02 Jungtaek Lim napisał(a): > Hi, > > The time window is closed and evicted as long as "eviction watermark" > passes the

Some optimization questions about our beloved engine Spark

2024-01-23 Thread Aissam Chia
Hi, I hope this email finds you well. Currently, I'm working on spark SQL and I have two main questions that I've been struggling with for 2 weeks now. I'm running spark on AWS EMR : 1. I'm running 30 spark applications in the same cluster. My applications are basically some SQL

Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-01-17 Thread Abhishek Singla
Hi Team, Version: 3.2.2 Java Version: 1.8.0_211 Scala Version: 2.12.15 Cluster: Standalone I am using Spark Streaming to read from Kafka and write to S3. The job fails with below error if there are no records published to Kafka for a few days and then there are some records published. Could

Re: unsubscribe

2024-01-17 Thread Крюков Виталий Семенович
unsubscribe От: Leandro Martelli Отправлено: 17 января 2024 г. 1:53:04 Кому: user@spark.apache.org Тема: unsubscribe unsubscribe

unsubscribe

2024-01-16 Thread Leandro Martelli
unsubscribe

Unsubscribe

2024-01-13 Thread Andrew Redd
Unsubscribe

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Actually, that did work, thanks. What I previously tried that did not work was #BSUB -env "all,SPARK_LOCAL_DIRS=/tmp,/share/,SPARK_PID_DIR=..." However, I am still getting "No space left on device" errors. It seems that I need hierarchical directories, and round robin distribution is not good

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Without spaces was the first thing I tried. The information in the pdf file inspired me to try the space. On Fri, Jan 12, 2024 at 10:23 PM Koert Kuipers wrote: > try it without spaces? > export SPARK_LOCAL_DIRS="/tmp,/share/" > > On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen > wrote: >

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Koert Kuipers
try it without spaces? export SPARK_LOCAL_DIRS="/tmp,/share/" On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen wrote: > Hello Spark community > > SPARK_LOCAL_DIRS or > spark.local.dir > is supposed to accept a list. > > I want to list one local (fast) drive, followed by a gpfs network drive,

[spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Hello Spark community SPARK_LOCAL_DIRS or spark.local.dir is supposed to accept a list. I want to list one local (fast) drive, followed by a gpfs network drive, similar to what is done here: https://cug.org/proceedings/cug2016_proceedings/includes/files/pap129s2-file1.pdf "Thus it is preferable

[GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-01-12 Thread Boileau, Brad
Hello, I was hoping to use a distribution of GraphFrames for AWS Glue 4 which has spark 3.3, but there is no found distribution for Spark 3.3 at this location: https://spark-packages.org/package/graphframes/graphframes Do you have any advice on the best compatible version to use for Spark 3.3?

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Mich Talebzadeh
catching up a bit late on this, I mentioned optimising RockDB as below in my earlier thread, specifically # Add RocksDB configurations here spark.conf.set("spark.sql.streaming.stateStore.providerClass",

Re: Structured Streaming Process Each Records Individually

2024-01-11 Thread Mich Talebzadeh
Hi, Let us visit the approach as some fellow members correctly highlighted the use case for spark structured streaming and two key concepts that I will mention - foreach: A method for applying custom write logic to each individual row in a streaming DataFrame or Dataset. -

Best option to process single kafka stream in parallel: PySpark Vs Dask

2024-01-11 Thread lab22
I am creating a setup to process packets from single kafta topic in parallel. For example, I have 3 containers (let's take 4 cores) on one vm, and from 1 kafka topic stream I create 10 jobs depending on packet source. These packets have small workload. 1. I can install dask in each

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Jungtaek Lim
If you use RocksDB state store provider, you can turn on changelog checkpoint to put the single changelog file per partition per batch. With disabling changelog checkpoint, Spark uploads newly created SST files and some log files. If compaction had happened, most SST files have to be re-uploaded.

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-11 Thread Jungtaek Lim
Hi, The time window is closed and evicted as long as "eviction watermark" passes the end of the window. Late events watermark only deals with discarding late events from "inputs". We did not introduce additional delay on the work of multiple stateful operators. We just allowed more late events to

Re: Okio Vulnerability in Spark 3.4.1

2024-01-11 Thread Bjørn Jørgensen
[SPARK-46662][K8S][BUILD] Upgrade kubernetes-client to 6.10.0 a new version of kubernets-client with okio version 1.17.6 is now merged to master and will be in the spark 4.0 version. tir. 14. nov. 2023 kl. 15:21 skrev Bjørn Jørgensen : > FYI > I have

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Ant Kutschera
Hi *Do we have any option to make streaming queries with multiple stateful operations output data without waiting this extra iteration? One of my ideas was to force an empty microbatch to run and propagate late events watermark without any new data. While this conceptually works, I didn't find a

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Ant Kutschera
It might be good to first split the stream up into smaller streams, one per type. If ordering of the Kafka records is important, then you could partition them at the source based on the type, but be careful how you configure Spark to read from Kafka as that could also influence ordering. kdf

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Mich Talebzadeh
Use an intermediate work table to put json data streaming in there in the first place and then according to the tag store the data in the correct table HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Khalid Mammadov
Use foreachBatch or foreach methods: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch On Wed, 10 Jan 2024, 17:42 PRASHANT L, wrote: > Hi > I have a use case where I need to process json payloads coming from Kafka > using structured

<    1   2   3   4   5   6   7   8   9   10   >