Dark mode logo

2024-03-06 Thread Mike Drob
Hi Spark Community, I see that y'all have a logo uploaded to https://www.apache.org/logos/#spark but it has black text. Is there an official, alternate logo with lighter text that would look good on a dark background? Thanks, Mike

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, Let me double-check it carefully. Thank you very much for your help! 发件人: Jungtaek Lim 发送时间: 2024年3月5日 21:56:41 收件人: Pan,Bingkun 抄送: Dongjoon Hyun; dev; user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yeah the approach seems OK to me - please double

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit spark-submit --verbose \ --master yarn \

S3 committer for dynamic partitioning

2024-03-05 Thread Nikhil Goyal
Hi folks, We have been following this doc for writing data from Spark Job to S3. However it fails writing to dynamic partitions. Any suggestions on what config should be used to avoid the cost of renaming in S3?

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
I use zip file personally and pass the application name (in your case main.py) as the last input line like below APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3 #

It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Pedro, Chuck
Hi all, I am working in Databricks. When I submit a spark job with the -py-files argument, it seems the first two are read in but the third is ignored. "--py-files", "s3://some_path/appl_src.py", "s3://some_path/main.py", "s3://a_different_path/common.py", I can see the first two acknowledged

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Yeah the approach seems OK to me - please double check that the doc generation in Spark repo won't fail after the move of the js file. Other than that, it would be probably just a matter of updating the release process. On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun wrote: > Okay, I see. > >

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, I see. Perhaps we can solve this confusion by sharing the same file `version.json` across `all versions` in the `Spark website repo`? Make each version of the document display the `same` data in the dropdown menu. 发件人: Jungtaek Lim 发送时间: 2024年3月5日

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Let me be more specific. We have two active release version lines, 3.4.x and 3.5.x. We just released Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the dropdown of Spark 3.4.3, there will be 3.5.1 and

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Based on my understanding, we should not update versions that have already been released, such as the situation you mentioned: `But what about dropout of version D? Should we add E in the dropdown?` We only need to record the latest `version. json` file that has already been published at the

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
But this does not answer my question about updating the dropdown for the doc of "already released versions", right? Let's say we just released version D, and the dropdown has version A, B, C. We have another release tomorrow as version E, and it's probably easy to add A, B, C, D in the dropdown

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
According to my understanding, the original intention of this feature is that when a user has entered the pyspark document, if he finds that the version he is currently in is not the version he wants, he can easily jump to the version he wants by clicking on the drop-down box. Additionally, in

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion. 发件人: Jungtaek Lim 日期: 2024年3月5日 星期二 10:46 收件人: Hyukjin Kwon 抄送: yangjie01 , Dongjoon Hyun , dev , user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in versionless doc (spark-website) rather than the doc being pinned to a specific version. On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon wrote: > Is this related to

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format test.ft.txt.bz2.zip So it is a text file that is compressed by bz2 followed by zip. Now I like tro do all these operations in PySpark. In

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual versions, and for each individual version we depend on other released versions. This does not seem to be right to me. Also, the functionality is only in PySpark API doc which does not seem to be consistent as well. I don't

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this! Xinrong Meng ezt írta (időpont: 2024. márc. 1., P, 5:24): > Congratulations! > > Thanks, > Xinrong > > On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun > wrote: > >> Congratulations! >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 28, 2024 at

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other recent releases. https://spark.apache.org/docs/3.5.0/api/python/index.html https://spark.apache.org/docs/3.4.2/api/python/index.html https://spark.apache.org/docs/3.3.4/api/python/index.html Looks like the dropdown feature

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek. PySpark document seems to show a wrong branch. At this time, `master`. https://spark.apache.org/docs/3.5.1/api/python/index.html PySpark Overview Date: Feb 24, 2024 Version: master

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations! On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark =

pyspark dataframe join with two different data type

2024-02-29 Thread Karthick Nk
Hi All, I have two dataframe with below structure, i have to join these two dataframe - the scenario is one column is string in one dataframe and in other df join column is array of string, so we have to inner join two df and get the data if string value is present in any of the array of string

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Xinrong Meng
Congratulations! Thanks, Xinrong On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations! Bests, Dongjoon. On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > Congratulations! > > > > At 2024-02-28 17:43:25, "Jungtaek Lim" > wrote: > > Hi everyone, > > We are happy to announce the availability of Spark 3.5.1! > > Spark 3.5.1 is a maintenance release containing

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Chawla, Parul
Hi , Can we get spark version on whuich this is resolved. From: Bjørn Jørgensen Sent: Tuesday, February 27, 2024 7:05:36 PM To: Sahni, Ashima Cc: Chawla, Parul ; user@spark.apache.org ; Misra Parashar, Jyoti ; Mekala, Rajesh ; Grandhi, Venkatesh ; George,

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Bjørn Jørgensen
[image: image.png] ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul : > > Hi , > Can we get spark version on whuich this is resolved. > -- > *From:* Bjørn Jørgensen > *Sent:* Tuesday, February 27, 2024 7:05:36 PM > *To:* Sahni, Ashima > *Cc:* Chawla, Parul ;

Re:[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread beliefer
Congratulations! At 2024-02-28 17:43:25, "Jungtaek Lim" wrote: Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly

[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly recommend all 3.5 users to upgrade to this stable release. To download Spark

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh
Hi, Quick observations from what you have provided - The observed discrepancy between rdd.count() and rdd.map(Item::getType).countByValue()in distributed mode suggests a potential aggregation issue with countByValue(). The correct results in local mode give credence to this theory. - Workarounds

[Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Stuart Fehr
Hello, I recently encountered a bug with the results from JavaRDD#countByValue that does not reproduce when running locally. For background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0. The code in question is something like this: JavaRDD rdd = // ... > rdd.count(); // 75187 // Get the

Re: Issue of spark with antlr version

2024-02-27 Thread Bjørn Jørgensen
[SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1 tir. 27. feb. 2024 kl. 13:25 skrev Sahni, Ashima : > Hi Team, > > > > Can you please let us know the update on below. > > > > Thanks, > > Ashima > > > > *From:* Chawla, Parul > *Sent:* Sunday, February

Re: Issue of spark with antlr version

2024-02-27 Thread Mich Talebzadeh
Hi, You have provided little information about where Spark fits in here. So I am guessing :) Data Source (JSON, XML, log file, etc.) --> Preprocessing (Spark jobs for filtering, cleaning, etc.)? --> Antlr Parser (Generated tool) --> Extracted Data (Mapped to model) --> Spring Data Model (Java

RE: Issue of spark with antlr version

2024-02-27 Thread Sahni, Ashima
Hi Team, Can you please let us know the update on below. Thanks, Ashima From: Chawla, Parul Sent: Sunday, February 25, 2024 11:57 PM To: user@spark.apache.org Cc: Sahni, Ashima ; Misra Parashar, Jyoti Subject: Issue of spark with antlr version Hi Spark Team, Our application is currently

Unsubscribe

2024-02-27 Thread benson fang
Unsubscribe Regards

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
Hi, Yes, I tested all of them on spark 3.5. Regards, Andrzej pon., 26 lut 2024 o 23:24 Mich Talebzadeh napisał(a): > Hi, > > These are all on spark 3.5, correct? > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > >view my Linkedin

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
Hi, These are all on spark 3.5, correct? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera
Hey all, I've been using Structured Streaming in production for almost a year already and I want to share the bugs I found in this time. I created a test for each of the issues and put them all here: https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala I split the issues

Re: Bintray replacement for spark-packages.org

2024-02-25 Thread Richard Eggert
I've been trying to obtain clarification on the terms of use regarding repo.spark-packages.org. I emailed feedb...@spark-packages.org two weeks ago, but have not heard back. Whom should I contact? On Mon, Apr 26, 2021 at 8:13 AM Bo Zhang wrote: > Hi Apache Spark users, > > As you might know,

Issue of spark with antlr version

2024-02-25 Thread Chawla, Parul
Hi Spark Team, Our application is currently using spring framrwork 5.3.31 .To upgrade it to 6.x , as per application dependency we must upgrade Spark and Hibernate jars as well . With Hibernate compatible upgrade, the dependent Antlr4 jar version has been upgraded to 4.10.1 but there's no

unsubscribe

2024-02-24 Thread Ameet Kini

Re: job uuid not unique

2024-02-24 Thread Xin Zhang
unsubscribe On Sat, Feb 17, 2024 at 3:04 AM Рамик И wrote: > > Hi > I'm using Spark Streaming to read from Kafka and write to S3. Sometimes I > get errors when writing org.apache.hadoop.fs.FileAlreadyExistsException. > > Spark version: 3.5.0 > scala version : 2.13.8 > Cluster: k8s > >

Re: AQE coalesce 60G shuffle data into a single partition

2024-02-24 Thread Enrico Minack
Hi Shay, maybe this is related to the small number of output rows (1,250) of the last exchange step that consume those 60GB shuffle data. Looks like your outer transformation is something like df.groupBy($"id").agg(collect_list($"prop_name")) Have you tried adding a repartition as an attempt

Re: [Beginner Debug]: Executor OutOfMemoryError

2024-02-23 Thread Mich Talebzadeh
Seems like you are having memory issues. Examine your settings. 1. It appears that your driver memory setting is too high. It should be a fraction of total memy provided by YARN 2. Use the Spark UI to monitor the job's memory consumption. Check the Storage tab to see how memory is

[Beginner Debug]: Executor OutOfMemoryError

2024-02-22 Thread Shawn Ligocki
Hi I'm new to Spark and I'm running into a lot of OOM issues while trying to scale up my first Spark application. I am running into these issues with only 1% of the final expected data size. Can anyone help me understand how to properly configure Spark to use limited memory or how to debug which

Re: unsubscribe

2024-02-21 Thread Xin Zhang
unsubscribe On Tue, Feb 20, 2024 at 9:44 PM kritika jain wrote: > Unsubscribe > > On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, > wrote: > >> >> unsubscribe >> >> >> -- Zhang Xin(张欣) Email:josseph.zh...@gmail.com

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh
Indeed valid points raised including the potential typo in the new spark version. I suggest, in the meantime, you should look for the so called alternative debugging methods - - Simpler explain(), try basic explain() or explain("extended"). This might provide a less detailed, but

Kafka-based Spark Streaming and Vertex AI for Sentiment Analysis

2024-02-21 Thread Mich Talebzadeh
I am working on a pet project to implement a real-time sentiment analysis system for analyzing customer reviews. It leverages Kafka for data ingestion, Spark Structured Streaming (SSS) for real-time processing, and Vertex AI for sentiment analysis and potential action triggers. *Features* -

[ANNOUNCE] Apache Kyuubi 1.8.1 is available

2024-02-20 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.1 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC and RESTful

Re: Spark 3.3 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Apologies. Issue is seen after we upgraded from Spark 3.1 to Spark 3.3. The same query runs fine on Spark 3.1. Omit the Spark version mentioned in email subject earlier. Anup Error trace: query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Hi Spark team, We ran into a dataframe issue after upgrading from spark 3.1 to 4. query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\" raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-20 Thread Manoj Kumar
Dear @Chao Sun, I trust you're doing well. Having worked extensively with Spark Nvidia Rapids, Velox, and Gluten, I'm now contemplating Comet's potential advantages over Velox in terms of performance and unique features. While Rapids leverages GPUs effectively, Gazelle's Intel AVX512 intrinsics

Re: unsubscribe

2024-02-20 Thread kritika jain
Unsubscribe On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, wrote: > > unsubscribe > > >

unsubscribe

2024-02-20 Thread Крюков Виталий Семенович
unsubscribe

Community Over Code Asia 2024 Travel Assistance Applications now open!

2024-02-20 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code Asia 2024 are now open! We will be supporting Community over Code Asia, Hangzhou, China July 26th - 28th, 2024. TAC exists

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Thanks for your kind words Sri Well it is true that as yet spark on kubernetes is not on-par with spark on YARN in maturity and essentially spark on kubernetes is still work in progress.* So in the first place IMO one needs to think why executors are failing. What causes this behaviour? Is it the

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this feature to K8s. [1] https://issues.apache.org/jira/browse/SPARK-41210 [2] https://github.com/apache/spark/pull/38732 Thanks, Cheng Pan > On

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Dear Mich, Thank you for your detailed response and the suggested approach to handling retry logic. I appreciate you taking the time to outline the method of embedding custom retry mechanisms directly into the application code. While the solution of wrapping the main logic of the Spark job in a

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Went through your issue with the code running on k8s When an executor of a Spark application fails, the system attempts to maintain the desired level of parallelism by automatically recreating a new executor to replace the failed one. While this behavior is beneficial for transient errors,

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Not that I am aware of any configuration parameter in Spark classic to limit executor creation. Because of fault tolerance Spark will try to recreate failed executors. Not really that familiar with the Spark operator for k8s. There may be something there. Have you considered custom monitoring and

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich, > Also have you got some benchmark results from your tests that you can possibly share? We only have some partial benchmark results internally so far. Once shuffle and better memory management have been introduced, we plan to publish the benchmark results (at least TPC-H) in the repo.

[Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Hello Spark Community, I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, for running various Spark applications. While the system generally works well, I've encountered a challenge related to how Spark applications handle executor failures, specifically in scenarios

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Yes I have gone through it. So let's give me the setup. More context - My jar file is in java language On Mon, Feb 19, 2024, 8:53 PM Mich Talebzadeh wrote: > Sure but first it would be beneficial to understand the way Spark works on > Kubernetes and the concept.s > > Have a look at this article

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
I am not using any private docker image. Only I am running the jar file in EMR using spark-submit command so now I want to run this jar file in eks so can you please tell me how can I set-up for this ?? On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi < jagannath.ma...@cloud.cbnits.com> wrote: >

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Sure but first it would be beneficial to understand the way Spark works on Kubernetes and the concept.s Have a look at this article of mine Spark on Kubernetes, A Practitioner’s Guide

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
OK you have a jar file that you want to work with when running using Spark on k8s as the execution engine (EKS) as opposed to YARN on EMR as the execution engine? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Where is your docker file? In ECR container registry. If you are going to use EKS, then it need to be accessible to all nodes of cluster When you build your docker image, put your jar under the $SPARK_HOME directory. Then add a line to your docker build file as below Here I am accessing Google

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Richard Smith
I run my Spark jobs in GCP with Google Dataproc using GCS buckets. I've not used AWS, but its EMR product offers similar functionality to Dataproc. The title of your post implies your Spark cluster runs on EKS. You might be better off using EMR, see links below: EMR

Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Dear Spark Community, I hope this email finds you well. I am reaching out to seek assistance and guidance regarding a task I'm currently working on involving Apache Spark. I have developed a JAR file that contains some Spark applications and functionality, and I need to run this JAR file within

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Mich Talebzadeh
OK got it Someone asked a similar but not related to shuffle question in Spark slack channel.. This is a simple Python code that creates shuffle files in shuffle_directory = "/tmp/spark_shuffles" and simulates working examples using a loop and periodically cleans up shuffle files older than 1

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Saha, Daniel
Thanks for the suggestions Mich, Jörn, and Adam. The rationale for long-lived app with loop versus submitting multiple yarn applications is mainly for simplicity. Plan to run app on an multi-tenant EMR cluster alongside other yarn apps. Implementing the loop outside the Spark app will work but

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-18 Thread Mich Talebzadeh
Hi, What do you propose or you think will help when these spark jobs are independent of each other --> So once a job/iterator is complete, there is no need to retain these shuffle files. You have a number of options to consider starting from spark configuration parameters and so forth

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Jörn Franke
You can try to shuffle to s3 using the cloud shuffle plugin for s3 (https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/) - the performance of the new plugin is for many spark jobs sufficient (it works also on EMR). Then you can use s3 lifecycle

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Adam Binford
If you're using dynamic allocation it could be caused by executors with shuffle data being deallocated before the shuffle is cleaned up. These shuffle files will never get cleaned up once that happens until the Yarn application ends. This was a big issue for us so I added support for deleting

Re: job uuid not unique

2024-02-16 Thread Mich Talebzadeh
As a bare minimum you will need to add some error trapping and exception handling! scala> import org.apache.hadoop.fs.FileAlreadyExistsException import org.apache.hadoop.fs.FileAlreadyExistsException and try your code try { df .coalesce(1) .write

Effectively append the dataset to avro directory

2024-02-16 Thread Rushikesh Kavar
Hello Community, I checked below issue in various platforms but I could not get satisfactory answer. I am using spark java. I am having large data cluster. My application is making more than 10 API calls. Each calls returns a java list. Each list item is of same structure (i.e. same java class)

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao, As a cool feature - Compared to standard Spark, what kind of performance gains can be expected with Comet? - Can one use Comet on k8s in conjunction with something like a Volcano addon? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in the form expected although I am aware of the shell script. Also have you got some benchmark results from your tests that you can possibly share? Thanks, Mich Talebzadeh, Dad | Technologist | Solutions Architect |

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
Hi Praveen, We will add a "Getting Started" section in the README soon, but basically comet-spark-shell in the repo should provide a basic tool to build Comet and launch a Spark shell with it. Note that we haven't

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread praveen sinha
Hi Chao, Is there any example app/gist/repo which can help me use this plugin. I wanted to try out some realtime aggregate performance on top of parquet and spark dataframes. Thanks and Regards Praveen On Wed, Feb 14, 2024 at 9:20 AM Chao Sun wrote: > > Out of interest what are the

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-14 Thread Chao Sun
> Out of interest what are the differences in the approach between this and > Glutten? Overall they are similar, although Gluten supports multiple backends including Velox and Clickhouse. One major difference is (obviously) Comet is based on DataFusion and Arrow, and written in Rust, while

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
Congratulations! Excellent work! On Tue, Feb 13, 2024 at 8:04 PM Yufei Gu wrote: > Absolutely thrilled to see the project going open-source! Huge congrats to > Chao and the entire team on this milestone! > > Yufei > > > On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > >> Hi all, >> >> We are

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Yufei Gu
Absolutely thrilled to see the project going open-source! Huge congrats to Chao and the entire team on this milestone! Yufei On Tue, Feb 13, 2024 at 12:43 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Mich Talebzadeh
You are getting DiskChecker$DiskErrorExceptionerror when no new records are published to Kafka for a few days. The error indicates that the Spark application could not find a valid local directory to create temporary files for data processing. This mightbe due to any of these - if no records

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion

Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Chao Sun
Hi all, We are very happy to announce that Project Comet, a plugin to accelerate Spark query execution via leveraging DataFusion and Arrow, has now been open sourced under the Apache Arrow umbrella. Please check the project repo https://github.com/apache/arrow-datafusion-comet for more details if

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Bjørn Jørgensen
DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001- out of space? tir. 13. feb. 2024 kl. 21:24 skrev Abhishek Singla < abhisheksingla...@gmail.com>: > Hi Team, > > Could someone provide some insights into this issue? > > Regards, > Abhishek Singla > > On

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Abhishek Singla
Hi Team, Could someone provide some insights into this issue? Regards, Abhishek Singla On Wed, Jan 17, 2024 at 11:45 PM Abhishek Singla < abhisheksingla...@gmail.com> wrote: > Hi Team, > > Version: 3.2.2 > Java Version: 1.8.0_211 > Scala Version: 2.12.15 > Cluster: Standalone > > I am using

Re: Null pointer exception while replying WAL

2024-02-12 Thread Mich Talebzadeh
OK Getting Null pointer exception while replying WAL! One possible reason is that the messages RDD might contain null elements, and attempting to read JSON from null values can result in an NPE. To handle this, you can add a filter before processing the RDD to remove null elements.

Re: Null pointer exception while replying WAL

2024-02-12 Thread nayan sharma
Please find below code def main(args: Array[String]): Unit = { val config: Config = ConfigFactory.load() val streamC = StreamingContext.getOrCreate( checkpointDirectory, () => functionToCreateContext(config, checkpointDirectory) ) streamC.start()

Re: Null pointer exception while replying WAL

2024-02-11 Thread Mich Talebzadeh
Hi, It is challenging to make a recommendation without further details. I am guessing you are trying to build a fault-tolerant spark application (spark structured streaming) that consumes messages from Solace? To address *NullPointerException* in the context of the provided information, you need

Null pointer exception while replying WAL

2024-02-09 Thread nayan sharma
Hi Users, I am trying to build fault tolerant spark solace consumer. Issue :- we have to take restart of the job due to multiple issue load average is one of them. At that time whatever spark is processing or batches in the queue is lost. We can't replay it because we already had send ack while

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
The full code is available from the link below https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
Appreciate your thoughts on this, Personally I think Spark Structured Streaming can be used effectively in an Event Driven Architecture as well as continuous streaming) >From the link here

performance of union vs insert into

2024-02-08 Thread Manish Mehra
Hello, I have an observation wherein performance of 'union' is lower when compared to multiple 'insert into' statements. Is this in line with Spark best practice? Regards Manish Mehra

[ANNOUNCE] Apache Celeborn(incubating) 0.4.0 available

2024-02-06 Thread Fu Chen
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.4.0. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Community over Code EU 2024 Travel Assistance Applications now open!

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

<    1   2   3   4   5   6   7   8   9   10   >