Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin or Cloud Shuffle Storage Plugin from AWS

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
Apache Uniffle (incubating) may be another solution. You can see https://github.com/apache/incubator-uniffle https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era Mich Talebzadeh 于2024年4月8日周一 07:15写道: > Splendid > > The

Re: External Spark shuffle service for k8s

2024-04-07 Thread Enrico Minack
There is Apache incubator project Uniffle: https://github.com/apache/incubator-uniffle It stores shuffle data on remote servers in memory, on local disk and HDFS. Cheers, Enrico Am 06.04.24 um 15:41 schrieb Mich Talebzadeh: I have seen some older references for shuffle service for k8s,

Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
OK, This is a common issue in Spark Structured Streaming (SSS), where the source generates data faster than Spark can process it. SSS doesn't have a built-in mechanism for directly rate-limiting the incoming data stream itself. However, consider the following: - Limit the rate at which data

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3 https://github.com/IBM/spark-s3-shuffle Though I would think a feature like this could be a part of the main Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) and it's very useful. Vakaris On Sun, Apr 7, 2024 at

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3]

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid The configurations below can be used with k8s deployments of Spark. Spark applications running on k8s can utilize these configurations to seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3. For Google GCS we may have spark_config_gcs = {

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
Thanks for your suggestion that I take it as a workaround. Whilst this workaround can potentially address storage allocation issues, I was more interested in exploring solutions that offer a more seamless integration with large distributed file systems like HDFS, GCS, or S3. This would ensure

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
You can make a PVC on K8S call it 300GB make a folder in yours dockerfile WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir start spark with adding this .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", "300gb") \

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it. Mich Talebzadeh 于2024年4月3日周三 01:55写道: > > hm. you are getting below > > AnalysisException: Append output mode not supported when there are > streaming aggregations on streaming DataFrames/DataSets without watermark; > > The

Re: [External] Re: Issue of spark with antlr version

2024-04-06 Thread Bjørn Jørgensen
*To:* Bjørn Jørgensen ; user@spark.apache.org < > user@spark.apache.org> > *Cc:* Sahni, Ashima ; > user@spark.apache.org ; Misra Parashar, Jyoti < > jyoti.misra.paras...@accenture.com> > *Subject:* Re: [External] Re: Issue of spark with antlr version > > Hi Team, > Any

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
ind regards, Dan From: Oxlade, Dan Sent: 03 April 2024 15:49 To: Oxlade, Dan ; Aaron Grubb ; user@spark.apache.org Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix Swapping out the iceberg-aws-bundle for the very latest aws provided sdk ('software.amaz

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
(ParquetFileFormat.scala:429) From: Oxlade, Dan Sent: 03 April 2024 14:33 To: Aaron Grubb ; user@spark.apache.org Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix [sorry; replying all this time] With hadoop-*-3.3.6 in place of the 3.4.0

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
April 2024 13:52 To: user@spark.apache.org Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using AWS. From: Oxlade, Dan Sent: Wednesday, April 3, 2024 2:41:11 PM To: user@spark.apache.org Subject:

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; The problem seems to be that you are using the append output mode when writing the streaming query results to Kafka. This mode

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich, Thank you so much for your response. I really appreciate your help! You mentioned "defining the watermark using the withWatermark function on the streaming_df before creating the temporary view” - I believe this is what I’m doing and it’s not working for me. Here is the exact code

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test. The original code of mine def fetch_data(self): self.sc.setLogLevel("ERROR") schema = StructType() \ .add("rowkey", StringType()) \ .add("timestamp", TimestampType()) \ .add("temperature", IntegerType())

Re: [External] Re: Issue of spark with antlr version

2024-04-01 Thread Chawla, Parul
; user@spark.apache.org ; Misra Parashar, Jyoti ; Mekala, Rajesh ; Grandhi, Venkatesh ; George, Rejish ; Tayal, Aayushi Subject: Re: [External] Re: Issue of spark with antlr version [image.png] ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul mailto:parul.cha...@accenture.com>>: Hi , Can

Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun
Hi, Cheng. Thank you for the suggestion. Your suggestion seems to have at least two themes. A. Adding a new Apache Spark community policy (contract) to guarantee MySQL LTS Versions Support. B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1) And, it brings me three questions.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Winston Lai
+1 -- Thank You & Best Regards Winston Lai From: Jay Han Date: Sunday, 24 March 2024 at 08:39 To: Kiran Kumar Dusi Cc: Farshid Ashouri , Matei Zaharia , Mich Talebzadeh , Spark dev list , user @spark Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Communit

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome! Kiran Kumar Dusi 于2024年3月21日周四 14:16写道: > +1 > > On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri < > farsheed.asho...@gmail.com> wrote: > >> +1 >> >> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, >> wrote: >> >>> Some of you may be aware that Databricks community Home |

Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI |

Re:

2024-03-21 Thread Mich Talebzadeh
You can try this val kafkaReadStream = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", broker) .option("subscribe", topicName) .option("startingOffsets", startingOffsetsMode) .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger) .load() kafkaReadStream

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Kiran Kumar Dusi
+1 On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri wrote: > +1 > > On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, > wrote: > >> Some of you may be aware that Databricks community Home | Databricks >> have just launched a knowledge sharing hub. I thought it would be a >> good idea for the Apache

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Farshid Ashouri
+1 On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, wrote: > Some of you may be aware that Databricks community Home | Databricks > have just launched a knowledge sharing hub. I thought it would be a > good idea for the Apache Spark user group to have the same, especially > for repeat questions on

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
d idea. Will be useful >>> >>> >>> >>> +1 >>> >>> >>> >>> >>> >>> >>> >>> *From: *ashok34...@yahoo.com.INVALID >>> *Date: *Monday, March 18, 2024 at 6:36 AM >>> *To: *user @s

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
com.INVALID Date: Monday, March 18, 2024 at 6:36 AM To: user @spark mailto:user@spark.apache.org>>, Spark dev list mailto:d...@spark.apache.org>>, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Cc: Matei Zaharia mailto:matei.zaha...@gmail.com>> Subject: Re: A pro

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
Hi @Mich Talebzadeh , community, Where can I find such insights on the Spark Architecture ? I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Varun Shah
+1 Great initiative. QQ : Stack overflow has a similar feature called "Collectives", but I am not sure of the expenses to create one for Apache Spark. With SO being used ( atleast before ChatGPT became quite the norm for searching questions), it already has a lot of questions asked and answered

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
>> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >> *Date: *Monday, March 18, 2024 at 6:36 AM >> *To: *user @spark , Spark dev list < >> d...@spark.apache.org>, Mich Talebzadeh >> *Cc: *Matei Zaharia >> *Subject: *R

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
org/wiki/Wernher_von_Braun>)". > > > On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud > wrote: > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALI

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
;>> >>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud < >>> mpars...@illumina.com.invalid>: >>> >>>> Good idea. Will be useful >>>> >>>> >>>> >>>> +1 >>>> >>&

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
;>> >>> >>> +1 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *From:* ashok34668@ yahoo. com. INVALID ( ashok3

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
ars 2024 kl. 17:26 skrev Parsian, Mahmoud > : > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >> *Date: *Monday, March 18, 2024 at 6:36 AM >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
y, March 18, 2024 at 6:36 AM > *To: *user @spark , Spark dev list < > d...@spark.apache.org>, Mich Talebzadeh > *Cc: *Matei Zaharia > *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache > Spark Community > > External message, be mindful when

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Code Tutelage
ark dev list < > d...@spark.apache.org>, Mich Talebzadeh > *Cc: *Matei Zaharia > *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache > Spark Community > > External message, be mindful when clicking links or attachments > > > > Good idea. Will be us

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
v list < > d...@spark.apache.org>, Mich Talebzadeh > *Cc: *Matei Zaharia > *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache > Spark Community > > External message, be mindful when clicking links or attachments > > > > Good idea. Will be useful >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Parsian, Mahmoud
Good idea. Will be useful +1 From: ashok34...@yahoo.com.INVALID Date: Monday, March 18, 2024 at 6:36 AM To: user @spark , Spark dev list , Mich Talebzadeh Cc: Matei Zaharia Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community External message, be mindful

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
Good idea. Will be useful +1 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh wrote: Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the

Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
Hi, I must admit I don't know much about this Fruchterman-Reingold (call it FR) visualization using GraphX and Kubernetes..But you are suggesting this slowdown issue starts after the second iteration, and caching/persisting the graph after each iteration does not help. FR involves many

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation

Re: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-17 Thread Ofir Manor
ases> Ofir From: Russell Jurney Sent: Friday, March 15, 2024 11:43 PM To: brad.boil...@fcc-fac.ca.invalid Cc: user@spark.apache.org Subject: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3? There is an implementa

Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released often enough to match every point version. It supports Spark 3.4. Try it - it will probably work. https://spark-packages.org/package/graphframes/graphframes Thanks, Russell Jurney @rjurney

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
Thanks for the clarification. That makes sense.. In the code below, we can see def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
Oh I see why the confusion. microbatch_data = event.progress means that microbatch_data is a StreamingQueryProgress instance, it's not a dictionary, so you should use ` microbatch_data.processedRowsPerSecond`, instead of the `get` method which is used for dictionaries. But weirdly, for

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
Hi, Do you think there is any chance for this issue to get resolved? Should I create another bug report? As mentioned in my message, there is one open already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers only one of the problems. Andrzej wt., 27 lut 2024 o 09:58 Andrzej Zera

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi, Thank you for your advice This is the amended code def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not 刘唯 于2024年3月10日周日 22:04写道: > Have you tried using microbatch_data.get("processedRowsPerSecond")? > Camel case now snake case > > Mich Talebzadeh 于2024年3月10日周日 11:46写道: > >> >> There is a paper from Databricks on this subject >> >> >>

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")? Camel case now snake case Mich Talebzadeh 于2024年3月10日周日 11:46写道: > > There is a paper from Databricks on this subject > > > https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html > > But

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
The error message shows a mismatch between the configured warehouse directory and the actual location accessible by the Spark application running in the container.. You have configured the SparkSession with spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where to store

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumption

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay interesting, maybe my assumption was incorrect, although I'm still confused. I tried to mount a central mount point that would be the same on my local machine and the container. Same error although I moved the path to /tmp/hive/data/hive/ but when I rerun the test code to save a table,

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit spark-submit --verbose \ --master yarn \

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
I use zip file personally and pass the application name (in your case main.py) as the last input line like below APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3 #

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
024年3月5日 17:09:07 > *收件人:* Pan,Bingkun > *抄送:* Dongjoon Hyun; dev; user > *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released > > Let me be more specific. > > We have two active release version lines, 3.4.x and 3.5.x. We just > released Spark 3.5.1, having a dropdown as 3.5.1 and

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
possible. > > Only by sharing the same version. json file in each version. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月5日 16:47:30 > *收件人:* Pan,Bingkun > *抄送:* Dongjoon Hyun; dev; user > *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released &

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
https://github.com/apache/spark/pull/42881 > > So, we need to manually update this file. I can manually submit an update > first to get this feature working. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月4日 6:34:42 > *收件人:* Dongjoon Hyun > *抄送:* dev;

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion. 发件人: Jungtaek Lim 日期: 2024年3月5日 星期二 10:46 收件人: Hyukjin Kwon 抄送: yangjie01 , Dongjoon Hyun , dev , user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in versionless doc (spark-website) rather than the doc being pinned to a specific version. On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon wrote: > Is this related to

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual versions, and for each individual version we depend on other released versions. This does not seem to be right to me. Also, the functionality is only in PySpark API doc which does not seem to be consistent as well. I don't

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this! Xinrong Meng ezt írta (időpont: 2024. márc. 1., P, 5:24): > Congratulations! > > Thanks, > Xinrong > > On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun > wrote: > >> Congratulations! >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 28, 2024 at

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other recent releases. https://spark.apache.org/docs/3.5.0/api/python/index.html https://spark.apache.org/docs/3.4.2/api/python/index.html https://spark.apache.org/docs/3.3.4/api/python/index.html Looks like the dropdown feature

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek. PySpark document seems to show a wrong branch. At this time, `master`. https://spark.apache.org/docs/3.5.1/api/python/index.html PySpark Overview Date: Feb 24, 2024 Version: master

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations! On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark =

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Xinrong Meng
Congratulations! Thanks, Xinrong On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations! Bests, Dongjoon. On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > Congratulations! > > > > At 2024-02-28 17:43:25, "Jungtaek Lim" > wrote: > > Hi everyone, > > We are happy to announce the availability of Spark 3.5.1! > > Spark 3.5.1 is a maintenance release containing

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Chawla, Parul
, Rejish ; Tayal, Aayushi Subject: [External] Re: Issue of spark with antlr version CAUTION: External email. Be cautious with links and attachments. [SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1<https://urldefense.com/v3/__https://github.com/apache/spark/pull/43075__;!!OrxsNty6D

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Bjørn Jørgensen
c:* Chawla, Parul ; user@spark.apache.org < > user@spark.apache.org>; Misra Parashar, Jyoti < > jyoti.misra.paras...@accenture.com>; Mekala, Rajesh < > r.mek...@accenture.com>; Grandhi, Venkatesh < > venkatesh.a.gran...@accenture.com>; George, Rejish < > rejish

Re:[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread beliefer
Congratulations! At 2024-02-28 17:43:25, "Jungtaek Lim" wrote: Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh
Hi, Quick observations from what you have provided - The observed discrepancy between rdd.count() and rdd.map(Item::getType).countByValue()in distributed mode suggests a potential aggregation issue with countByValue(). The correct results in local mode give credence to this theory. - Workarounds

Re: Issue of spark with antlr version

2024-02-27 Thread Bjørn Jørgensen
[SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1 tir. 27. feb. 2024 kl. 13:25 skrev Sahni, Ashima : > Hi Team, > > > > Can you please let us know the update on below. > > > > Thanks, > > Ashima > > > > *From:* Chawla, Parul > *Sent:* Sunday, February

Re: Issue of spark with antlr version

2024-02-27 Thread Mich Talebzadeh
Hi, You have provided little information about where Spark fits in here. So I am guessing :) Data Source (JSON, XML, log file, etc.) --> Preprocessing (Spark jobs for filtering, cleaning, etc.)? --> Antlr Parser (Generated tool) --> Extracted Data (Mapped to model) --> Spring Data Model (Java

RE: Issue of spark with antlr version

2024-02-27 Thread Sahni, Ashima
Hi Team, Can you please let us know the update on below. Thanks, Ashima From: Chawla, Parul Sent: Sunday, February 25, 2024 11:57 PM To: user@spark.apache.org Cc: Sahni, Ashima ; Misra Parashar, Jyoti Subject: Issue of spark with antlr version Hi Spark Team, Our application is currently

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
Hi, Yes, I tested all of them on spark 3.5. Regards, Andrzej pon., 26 lut 2024 o 23:24 Mich Talebzadeh napisał(a): > Hi, > > These are all on spark 3.5, correct? > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > >view my Linkedin

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
Hi, These are all on spark 3.5, correct? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The

Re: Bintray replacement for spark-packages.org

2024-02-25 Thread Richard Eggert
I've been trying to obtain clarification on the terms of use regarding repo.spark-packages.org. I emailed feedb...@spark-packages.org two weeks ago, but have not heard back. Whom should I contact? On Mon, Apr 26, 2021 at 8:13 AM Bo Zhang wrote: > Hi Apache Spark users, > > As you might know,

Re: job uuid not unique

2024-02-24 Thread Xin Zhang
unsubscribe On Sat, Feb 17, 2024 at 3:04 AM Рамик И wrote: > > Hi > I'm using Spark Streaming to read from Kafka and write to S3. Sometimes I > get errors when writing org.apache.hadoop.fs.FileAlreadyExistsException. > > Spark version: 3.5.0 > scala version : 2.13.8 > Cluster: k8s > >

Re: AQE coalesce 60G shuffle data into a single partition

2024-02-24 Thread Enrico Minack
Hi Shay, maybe this is related to the small number of output rows (1,250) of the last exchange step that consume those 60GB shuffle data. Looks like your outer transformation is something like df.groupBy($"id").agg(collect_list($"prop_name")) Have you tried adding a repartition as an attempt

Re: [Beginner Debug]: Executor OutOfMemoryError

2024-02-23 Thread Mich Talebzadeh
Seems like you are having memory issues. Examine your settings. 1. It appears that your driver memory setting is too high. It should be a fraction of total memy provided by YARN 2. Use the Spark UI to monitor the job's memory consumption. Check the Storage tab to see how memory is

Re: unsubscribe

2024-02-21 Thread Xin Zhang
unsubscribe On Tue, Feb 20, 2024 at 9:44 PM kritika jain wrote: > Unsubscribe > > On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, > wrote: > >> >> unsubscribe >> >> >> -- Zhang Xin(张欣) Email:josseph.zh...@gmail.com

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh
Indeed valid points raised including the potential typo in the new spark version. I suggest, in the meantime, you should look for the so called alternative debugging methods - - Simpler explain(), try basic explain() or explain("extended"). This might provide a less detailed, but

Re: Spark 3.3 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Apologies. Issue is seen after we upgraded from Spark 3.1 to Spark 3.3. The same query runs fine on Spark 3.1. Omit the Spark version mentioned in email subject earlier. Anup Error trace: query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-20 Thread Manoj Kumar
Dear @Chao Sun, I trust you're doing well. Having worked extensively with Spark Nvidia Rapids, Velox, and Gluten, I'm now contemplating Comet's potential advantages over Velox in terms of performance and unique features. While Rapids leverages GPUs effectively, Gazelle's Intel AVX512 intrinsics

Re: unsubscribe

2024-02-20 Thread kritika jain
Unsubscribe On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, wrote: > > unsubscribe > > >

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Thanks for your kind words Sri Well it is true that as yet spark on kubernetes is not on-par with spark on YARN in maturity and essentially spark on kubernetes is still work in progress.* So in the first place IMO one needs to think why executors are failing. What causes this behaviour? Is it the

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this feature to K8s. [1] https://issues.apache.org/jira/browse/SPARK-41210 [2] https://github.com/apache/spark/pull/38732 Thanks, Cheng Pan > On

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Dear Mich, Thank you for your detailed response and the suggested approach to handling retry logic. I appreciate you taking the time to outline the method of embedding custom retry mechanisms directly into the application code. While the solution of wrapping the main logic of the Spark job in a

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Went through your issue with the code running on k8s When an executor of a Spark application fails, the system attempts to maintain the desired level of parallelism by automatically recreating a new executor to replace the failed one. While this behavior is beneficial for transient errors,

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Not that I am aware of any configuration parameter in Spark classic to limit executor creation. Because of fault tolerance Spark will try to recreate failed executors. Not really that familiar with the Spark operator for k8s. There may be something there. Have you considered custom monitoring and

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich, > Also have you got some benchmark results from your tests that you can possibly share? We only have some partial benchmark results internally so far. Once shuffle and better memory management have been introduced, we plan to publish the benchmark results (at least TPC-H) in the repo.

<    1   2   3   4   5   6   7   8   9   10   >