Re: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-17 Thread Ofir Manor
Just to add - the latest version is 0.8.3, it seems to support 3.3: "Support Spark 3.3 / Scala 2.12 , Spark 3.4 / Scala 2.12 and Scala 2.13, Spark 3.5 / Scala 2.12 and Scala 2.13" Releases · graphframes/graphframes (github.com) Ofir

Python library that generates fake data using Faker

2024-03-16 Thread Mich Talebzadeh
I came across this a few weeks ago. II a nutshell you can use it for generating test data and other scenarios where you need realistic-looking but not necessarily real data. With so many regulations and copyrights etc it is a viable alternative. I used it to generate 1000 lines of mixed true and

Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released often enough to match every point version. It supports Spark 3.4. Try it - it will probably work. https://spark-packages.org/package/graphframes/graphframes Thanks, Russell Jurney @rjurney

Requesting further assistance with Spark Scala code coverage

2024-03-14 Thread 里昂
I have sent out an email regarding Spark coverage, but haven't received any response. I'm hoping someone could provide an answer on whether there is currently any code coverage statistics available for Scala code in Spark. https://lists.apache.org/thread/hob7x42gk3q244t9b0d8phwjtxjk2plt

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, 'course_title':

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
Thanks for the clarification. That makes sense.. In the code below, we can see def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
Oh I see why the confusion. microbatch_data = event.progress means that microbatch_data is a StreamingQueryProgress instance, it's not a dictionary, so you should use ` microbatch_data.processedRowsPerSecond`, instead of the `get` method which is used for dictionaries. But weirdly, for

Data ingestion into elastic failing using pyspark

2024-03-11 Thread Karthick Nk
Hi @all, I am using pyspark program to write the data into elastic index by using upsert operation (sample code snippet below). def writeDataToES(final_df): write_options = { "es.nodes": elastic_host, "es.net.ssl": "false", "es.nodes.wan.only": "true",

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
Hi, Do you think there is any chance for this issue to get resolved? Should I create another bug report? As mentioned in my message, there is one open already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers only one of the problems. Andrzej wt., 27 lut 2024 o 09:58 Andrzej Zera

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi, Thank you for your advice This is the amended code def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not 刘唯 于2024年3月10日周日 22:04写道: > Have you tried using microbatch_data.get("processedRowsPerSecond")? > Camel case now snake case > > Mich Talebzadeh 于2024年3月10日周日 11:46写道: > >> >> There is a paper from Databricks on this subject >> >> >>

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")? Camel case now snake case Mich Talebzadeh 于2024年3月10日周日 11:46写道: > > There is a paper from Databricks on this subject > > > https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html > > But

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
There is a paper from Databricks on this subject https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html But having tested it, there seems to be a bug there that I reported to Databricks forum as well (in answer to a user question) I have come to a conclusion

Spark on Kubenets, execute dataset.show raise exceptions

2024-03-09 Thread BODY NO
Hi, I encountered a strange issue. I run spark-shell with client mode in kubernets. as below command: val data=spark.read.parquet("datapath") When I run: "data.show", it may raise exceptions, the stacktrace like below: DEBUG BlockManagerMasterEndpoint: Updating block info on master taskresult_3

Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-08 Thread sharad mishra
Hi Team, We're encountering an issue with Spark UI. When enabled reverse proxy in master and worker configOptions. We're not able to access different tabs available in spark UI e.g.(stages, environment, storage etc.) We're deploying spark through bitnami helm chart :

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
The error message shows a mismatch between the configured warehouse directory and the actual location accessible by the Spark application running in the container.. You have configured the SparkSession with spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where to store

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumption

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay interesting, maybe my assumption was incorrect, although I'm still confused. I tried to mount a central mount point that would be the same on my local machine and the container. Same error although I moved the path to /tmp/hive/data/hive/ but when I rerun the test code to save a table,

Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Wonder if anyone can just sort my brain out here as to whats possible or not. I have a container running Spark, with Hive and a ThriftServer. I want to run code against it remotely. If I take something simple like this from pyspark.sql import SparkSession from pyspark.sql.types import

Dark mode logo

2024-03-06 Thread Mike Drob
Hi Spark Community, I see that y'all have a logo uploaded to https://www.apache.org/logos/#spark but it has black text. Is there an official, alternate logo with lighter text that would look good on a dark background? Thanks, Mike

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, Let me double-check it carefully. Thank you very much for your help! 发件人: Jungtaek Lim 发送时间: 2024年3月5日 21:56:41 收件人: Pan,Bingkun 抄送: Dongjoon Hyun; dev; user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yeah the approach seems OK to me - please double

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit spark-submit --verbose \ --master yarn \

S3 committer for dynamic partitioning

2024-03-05 Thread Nikhil Goyal
Hi folks, We have been following this doc for writing data from Spark Job to S3. However it fails writing to dynamic partitions. Any suggestions on what config should be used to avoid the cost of renaming in S3?

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
I use zip file personally and pass the application name (in your case main.py) as the last input line like below APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3 #

It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Pedro, Chuck
Hi all, I am working in Databricks. When I submit a spark job with the -py-files argument, it seems the first two are read in but the third is ignored. "--py-files", "s3://some_path/appl_src.py", "s3://some_path/main.py", "s3://a_different_path/common.py", I can see the first two acknowledged

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Yeah the approach seems OK to me - please double check that the doc generation in Spark repo won't fail after the move of the js file. Other than that, it would be probably just a matter of updating the release process. On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun wrote: > Okay, I see. > >

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, I see. Perhaps we can solve this confusion by sharing the same file `version.json` across `all versions` in the `Spark website repo`? Make each version of the document display the `same` data in the dropdown menu. 发件人: Jungtaek Lim 发送时间: 2024年3月5日

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Let me be more specific. We have two active release version lines, 3.4.x and 3.5.x. We just released Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the dropdown of Spark 3.4.3, there will be 3.5.1 and

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Based on my understanding, we should not update versions that have already been released, such as the situation you mentioned: `But what about dropout of version D? Should we add E in the dropdown?` We only need to record the latest `version. json` file that has already been published at the

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
But this does not answer my question about updating the dropdown for the doc of "already released versions", right? Let's say we just released version D, and the dropdown has version A, B, C. We have another release tomorrow as version E, and it's probably easy to add A, B, C, D in the dropdown

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
According to my understanding, the original intention of this feature is that when a user has entered the pyspark document, if he finds that the version he is currently in is not the version he wants, he can easily jump to the version he wants by clicking on the drop-down box. Additionally, in

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion. 发件人: Jungtaek Lim 日期: 2024年3月5日 星期二 10:46 收件人: Hyukjin Kwon 抄送: yangjie01 , Dongjoon Hyun , dev , user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in versionless doc (spark-website) rather than the doc being pinned to a specific version. On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon wrote: > Is this related to

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format test.ft.txt.bz2.zip So it is a text file that is compressed by bz2 followed by zip. Now I like tro do all these operations in PySpark. In

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual versions, and for each individual version we depend on other released versions. This does not seem to be right to me. Also, the functionality is only in PySpark API doc which does not seem to be consistent as well. I don't

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this! Xinrong Meng ezt írta (időpont: 2024. márc. 1., P, 5:24): > Congratulations! > > Thanks, > Xinrong > > On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun > wrote: > >> Congratulations! >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 28, 2024 at

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other recent releases. https://spark.apache.org/docs/3.5.0/api/python/index.html https://spark.apache.org/docs/3.4.2/api/python/index.html https://spark.apache.org/docs/3.3.4/api/python/index.html Looks like the dropdown feature

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek. PySpark document seems to show a wrong branch. At this time, `master`. https://spark.apache.org/docs/3.5.1/api/python/index.html PySpark Overview Date: Feb 24, 2024 Version: master

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations! On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark =

pyspark dataframe join with two different data type

2024-02-29 Thread Karthick Nk
Hi All, I have two dataframe with below structure, i have to join these two dataframe - the scenario is one column is string in one dataframe and in other df join column is array of string, so we have to inner join two df and get the data if string value is present in any of the array of string

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Xinrong Meng
Congratulations! Thanks, Xinrong On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations! Bests, Dongjoon. On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > Congratulations! > > > > At 2024-02-28 17:43:25, "Jungtaek Lim" > wrote: > > Hi everyone, > > We are happy to announce the availability of Spark 3.5.1! > > Spark 3.5.1 is a maintenance release containing

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Chawla, Parul
Hi , Can we get spark version on whuich this is resolved. From: Bjørn Jørgensen Sent: Tuesday, February 27, 2024 7:05:36 PM To: Sahni, Ashima Cc: Chawla, Parul ; user@spark.apache.org ; Misra Parashar, Jyoti ; Mekala, Rajesh ; Grandhi, Venkatesh ; George,

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Bjørn Jørgensen
[image: image.png] ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul : > > Hi , > Can we get spark version on whuich this is resolved. > -- > *From:* Bjørn Jørgensen > *Sent:* Tuesday, February 27, 2024 7:05:36 PM > *To:* Sahni, Ashima > *Cc:* Chawla, Parul ;

Re:[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread beliefer
Congratulations! At 2024-02-28 17:43:25, "Jungtaek Lim" wrote: Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly

[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly recommend all 3.5 users to upgrade to this stable release. To download Spark

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh
Hi, Quick observations from what you have provided - The observed discrepancy between rdd.count() and rdd.map(Item::getType).countByValue()in distributed mode suggests a potential aggregation issue with countByValue(). The correct results in local mode give credence to this theory. - Workarounds

[Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Stuart Fehr
Hello, I recently encountered a bug with the results from JavaRDD#countByValue that does not reproduce when running locally. For background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0. The code in question is something like this: JavaRDD rdd = // ... > rdd.count(); // 75187 // Get the

Re: Issue of spark with antlr version

2024-02-27 Thread Bjørn Jørgensen
[SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1 tir. 27. feb. 2024 kl. 13:25 skrev Sahni, Ashima : > Hi Team, > > > > Can you please let us know the update on below. > > > > Thanks, > > Ashima > > > > *From:* Chawla, Parul > *Sent:* Sunday, February

Re: Issue of spark with antlr version

2024-02-27 Thread Mich Talebzadeh
Hi, You have provided little information about where Spark fits in here. So I am guessing :) Data Source (JSON, XML, log file, etc.) --> Preprocessing (Spark jobs for filtering, cleaning, etc.)? --> Antlr Parser (Generated tool) --> Extracted Data (Mapped to model) --> Spring Data Model (Java

RE: Issue of spark with antlr version

2024-02-27 Thread Sahni, Ashima
Hi Team, Can you please let us know the update on below. Thanks, Ashima From: Chawla, Parul Sent: Sunday, February 25, 2024 11:57 PM To: user@spark.apache.org Cc: Sahni, Ashima ; Misra Parashar, Jyoti Subject: Issue of spark with antlr version Hi Spark Team, Our application is currently

Unsubscribe

2024-02-27 Thread benson fang
Unsubscribe Regards

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
Hi, Yes, I tested all of them on spark 3.5. Regards, Andrzej pon., 26 lut 2024 o 23:24 Mich Talebzadeh napisał(a): > Hi, > > These are all on spark 3.5, correct? > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > >view my Linkedin

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
Hi, These are all on spark 3.5, correct? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera
Hey all, I've been using Structured Streaming in production for almost a year already and I want to share the bugs I found in this time. I created a test for each of the issues and put them all here: https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala I split the issues

Re: Bintray replacement for spark-packages.org

2024-02-25 Thread Richard Eggert
I've been trying to obtain clarification on the terms of use regarding repo.spark-packages.org. I emailed feedb...@spark-packages.org two weeks ago, but have not heard back. Whom should I contact? On Mon, Apr 26, 2021 at 8:13 AM Bo Zhang wrote: > Hi Apache Spark users, > > As you might know,

Issue of spark with antlr version

2024-02-25 Thread Chawla, Parul
Hi Spark Team, Our application is currently using spring framrwork 5.3.31 .To upgrade it to 6.x , as per application dependency we must upgrade Spark and Hibernate jars as well . With Hibernate compatible upgrade, the dependent Antlr4 jar version has been upgraded to 4.10.1 but there's no

unsubscribe

2024-02-24 Thread Ameet Kini

Re: job uuid not unique

2024-02-24 Thread Xin Zhang
unsubscribe On Sat, Feb 17, 2024 at 3:04 AM Рамик И wrote: > > Hi > I'm using Spark Streaming to read from Kafka and write to S3. Sometimes I > get errors when writing org.apache.hadoop.fs.FileAlreadyExistsException. > > Spark version: 3.5.0 > scala version : 2.13.8 > Cluster: k8s > >

Re: AQE coalesce 60G shuffle data into a single partition

2024-02-24 Thread Enrico Minack
Hi Shay, maybe this is related to the small number of output rows (1,250) of the last exchange step that consume those 60GB shuffle data. Looks like your outer transformation is something like df.groupBy($"id").agg(collect_list($"prop_name")) Have you tried adding a repartition as an attempt

Re: [Beginner Debug]: Executor OutOfMemoryError

2024-02-23 Thread Mich Talebzadeh
Seems like you are having memory issues. Examine your settings. 1. It appears that your driver memory setting is too high. It should be a fraction of total memy provided by YARN 2. Use the Spark UI to monitor the job's memory consumption. Check the Storage tab to see how memory is

[Beginner Debug]: Executor OutOfMemoryError

2024-02-22 Thread Shawn Ligocki
Hi I'm new to Spark and I'm running into a lot of OOM issues while trying to scale up my first Spark application. I am running into these issues with only 1% of the final expected data size. Can anyone help me understand how to properly configure Spark to use limited memory or how to debug which

Re: unsubscribe

2024-02-21 Thread Xin Zhang
unsubscribe On Tue, Feb 20, 2024 at 9:44 PM kritika jain wrote: > Unsubscribe > > On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, > wrote: > >> >> unsubscribe >> >> >> -- Zhang Xin(张欣) Email:josseph.zh...@gmail.com

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh
Indeed valid points raised including the potential typo in the new spark version. I suggest, in the meantime, you should look for the so called alternative debugging methods - - Simpler explain(), try basic explain() or explain("extended"). This might provide a less detailed, but

Kafka-based Spark Streaming and Vertex AI for Sentiment Analysis

2024-02-21 Thread Mich Talebzadeh
I am working on a pet project to implement a real-time sentiment analysis system for analyzing customer reviews. It leverages Kafka for data ingestion, Spark Structured Streaming (SSS) for real-time processing, and Vertex AI for sentiment analysis and potential action triggers. *Features* -

[ANNOUNCE] Apache Kyuubi 1.8.1 is available

2024-02-20 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.1 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC and RESTful

Re: Spark 3.3 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Apologies. Issue is seen after we upgraded from Spark 3.1 to Spark 3.3. The same query runs fine on Spark 3.1. Omit the Spark version mentioned in email subject earlier. Anup Error trace: query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
Hi Spark team, We ran into a dataframe issue after upgrading from spark 3.1 to 4. query_result.explain(extended=True)\n File \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\" raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-20 Thread Manoj Kumar
Dear @Chao Sun, I trust you're doing well. Having worked extensively with Spark Nvidia Rapids, Velox, and Gluten, I'm now contemplating Comet's potential advantages over Velox in terms of performance and unique features. While Rapids leverages GPUs effectively, Gazelle's Intel AVX512 intrinsics

Re: unsubscribe

2024-02-20 Thread kritika jain
Unsubscribe On Tue, 20 Feb 2024, 3:18 pm Крюков Виталий Семенович, wrote: > > unsubscribe > > >

unsubscribe

2024-02-20 Thread Крюков Виталий Семенович
unsubscribe

Community Over Code Asia 2024 Travel Assistance Applications now open!

2024-02-20 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code Asia 2024 are now open! We will be supporting Community over Code Asia, Hangzhou, China July 26th - 28th, 2024. TAC exists

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Thanks for your kind words Sri Well it is true that as yet spark on kubernetes is not on-par with spark on YARN in maturity and essentially spark on kubernetes is still work in progress.* So in the first place IMO one needs to think why executors are failing. What causes this behaviour? Is it the

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this feature to K8s. [1] https://issues.apache.org/jira/browse/SPARK-41210 [2] https://github.com/apache/spark/pull/38732 Thanks, Cheng Pan > On

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Dear Mich, Thank you for your detailed response and the suggested approach to handling retry logic. I appreciate you taking the time to outline the method of embedding custom retry mechanisms directly into the application code. While the solution of wrapping the main logic of the Spark job in a

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Went through your issue with the code running on k8s When an executor of a Spark application fails, the system attempts to maintain the desired level of parallelism by automatically recreating a new executor to replace the failed one. While this behavior is beneficial for transient errors,

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Not that I am aware of any configuration parameter in Spark classic to limit executor creation. Because of fault tolerance Spark will try to recreate failed executors. Not really that familiar with the Spark operator for k8s. There may be something there. Have you considered custom monitoring and

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Chao Sun
Hi Mich, > Also have you got some benchmark results from your tests that you can possibly share? We only have some partial benchmark results internally so far. Once shuffle and better memory management have been introduced, we plan to publish the benchmark results (at least TPC-H) in the repo.

[Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Hello Spark Community, I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, for running various Spark applications. While the system generally works well, I've encountered a challenge related to how Spark applications handle executor failures, specifically in scenarios

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Yes I have gone through it. So let's give me the setup. More context - My jar file is in java language On Mon, Feb 19, 2024, 8:53 PM Mich Talebzadeh wrote: > Sure but first it would be beneficial to understand the way Spark works on > Kubernetes and the concept.s > > Have a look at this article

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
I am not using any private docker image. Only I am running the jar file in EMR using spark-submit command so now I want to run this jar file in eks so can you please tell me how can I set-up for this ?? On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi < jagannath.ma...@cloud.cbnits.com> wrote: >

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Sure but first it would be beneficial to understand the way Spark works on Kubernetes and the concept.s Have a look at this article of mine Spark on Kubernetes, A Practitioner’s Guide

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
OK you have a jar file that you want to work with when running using Spark on k8s as the execution engine (EKS) as opposed to YARN on EMR as the execution engine? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Where is your docker file? In ECR container registry. If you are going to use EKS, then it need to be accessible to all nodes of cluster When you build your docker image, put your jar under the $SPARK_HOME directory. Then add a line to your docker build file as below Here I am accessing Google

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Richard Smith
I run my Spark jobs in GCP with Google Dataproc using GCS buckets. I've not used AWS, but its EMR product offers similar functionality to Dataproc. The title of your post implies your Spark cluster runs on EKS. You might be better off using EMR, see links below: EMR

Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Dear Spark Community, I hope this email finds you well. I am reaching out to seek assistance and guidance regarding a task I'm currently working on involving Apache Spark. I have developed a JAR file that contains some Spark applications and functionality, and I need to run this JAR file within

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Mich Talebzadeh
OK got it Someone asked a similar but not related to shuffle question in Spark slack channel.. This is a simple Python code that creates shuffle files in shuffle_directory = "/tmp/spark_shuffles" and simulates working examples using a loop and periodically cleans up shuffle files older than 1

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Saha, Daniel
Thanks for the suggestions Mich, Jörn, and Adam. The rationale for long-lived app with loop versus submitting multiple yarn applications is mainly for simplicity. Plan to run app on an multi-tenant EMR cluster alongside other yarn apps. Implementing the loop outside the Spark app will work but

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-18 Thread Mich Talebzadeh
Hi, What do you propose or you think will help when these spark jobs are independent of each other --> So once a job/iterator is complete, there is no need to retain these shuffle files. You have a number of options to consider starting from spark configuration parameters and so forth

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Jörn Franke
You can try to shuffle to s3 using the cloud shuffle plugin for s3 (https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/) - the performance of the new plugin is for many spark jobs sufficient (it works also on EMR). Then you can use s3 lifecycle

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Adam Binford
If you're using dynamic allocation it could be caused by executors with shuffle data being deallocated before the shuffle is cleaned up. These shuffle files will never get cleaned up once that happens until the Yarn application ends. This was a big issue for us so I added support for deleting

Re: job uuid not unique

2024-02-16 Thread Mich Talebzadeh
As a bare minimum you will need to add some error trapping and exception handling! scala> import org.apache.hadoop.fs.FileAlreadyExistsException import org.apache.hadoop.fs.FileAlreadyExistsException and try your code try { df .coalesce(1) .write

Effectively append the dataset to avro directory

2024-02-16 Thread Rushikesh Kavar
Hello Community, I checked below issue in various platforms but I could not get satisfactory answer. I am using spark java. I am having large data cluster. My application is making more than 10 API calls. Each calls returns a java list. Each list item is of same structure (i.e. same java class)

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao, As a cool feature - Compared to standard Spark, what kind of performance gains can be expected with Comet? - Can one use Comet on k8s in conjunction with something like a Volcano addon? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London

<    1   2   3   4   5   6   7   8   9   10   >