Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
it deletes and copies the partitions. My issue is something related to this - https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1 With Best Regards, Dipayan Dev On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh wrote: > Spark has no role in creating that hive stag

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh
Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
++ DEV community On Mon, Jul 17, 2023 at 4:14 PM Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project locally for my 1st > Open Source Contribution, by building and creating a pa

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apach

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
ll take a long time to perform this step. One workaround will be > to create smaller number of larger files if that is possible from Spark and > if this is not possible then those configurations allow for configuring the > threadpool which does the metadata copy. > > You can go thr

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
ote the > MLlib-specific contribution guidelines section in particular. > > https://spark.apache.org/contributing.html > > Since you are looking for something to start with, take a look at this > Jira query for starter issues. > > > https://issues.apache.org/jira/browse/S

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
restingly, it took only 10 minutes to write the output in the staging > directory and rest of the time it took to rename the objects. Thats the > concern. > > Looks like a known issue as spark behaves with GCS but not getting any > workaround for this. > > > On Mon, 17 Jul

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4 hours to finish the job. Interestingly, it took only 10 minutes to write the output in the staging directory and rest of the time it took to rename the objects. Thats the concern. Looks like a known issue as spark behaves

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
er algorithms? > > I tried v2 algorithm but its not enhancing the runtime. What’s the best > practice in Dataproc for dynamic updates in Spark. > > > On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > >> You can try increasing fs.gs.batch.threads and >> fs.gs.max.requests

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increas

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Sp

Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Resending this message with a proper Subject line Hi Spark Community, I am trying to set up my forked apache/spark project locally for my 1st Open Source Contribution, by building and creating a package as mentioned here under Running Individual Tests <https://spark.apache.org/develo

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Hi All, Of late, I have encountered the issue where I have to overwrite a lot of partitions of the Hive table through Spark. It looks like writing to hive_staging_directory takes 25% of the total time, whereas 75% or more time goes in moving the ORC files from staging directory to the final

Re: Contributing to Spark MLLib

2023-07-16 Thread Brian Huynh
this Jira query for starter issues. https://issues.apache.org/jira/browse/SPARK-38719?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20%22starter%22%20AND%20status%20%3D%20Open Cheers, Brian On Sun, Jul 16, 2023 at 8:49 AM Dipayan Dev wrote: > Hi Spark Community, > > A very good morni

Contributing to Spark MLLib

2023-07-16 Thread Dipayan Dev
Hi Spark Community, A very good morning to you. I am using Spark from last few years now, and new to the community. I am very much interested to be a contributor. I am looking to contribute to Spark MLLib. Can anyone please suggest me how to start with contributing to any new MLLib feature? Is

[Spark RPC]: Yarn - Application Master / executors to Driver communication issue

2023-07-14 Thread Sunayan Saikia
Hey Spark Community, Our Jupyterhub/Jupyterlab (with spark client) runs behind two layers of HAProxy and the Yarn cluster runs remotely. We want to use deploy mode 'client' so that we can capture the output of any spark sql query in jupyterlab. I'm aware of other technologies like

Re: Unable to populate spark metrics using custom metrics API

2023-07-13 Thread Surya Soma
Gentle reminder on this. On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote: > Hello, > > I am trying to publish custom metrics using Spark CustomMetric API as > supported since spark 3.2 https://github.com/apache/spark/pull/31476, > > > https://spark.apache.org/docs/3.2.

Re: Spark Not Connecting

2023-07-12 Thread Artemis User
Well, in that case, you may want to make sure your Spark server is running properly and you can access the Spark UI using your browser.  If you're not owning the spark cluster, contact your spark admin. On 7/12/23 1:56 PM, timi ayoade wrote: I can't even connect to the spark UI O

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it

Spark Not Connecting

2023-07-12 Thread timi ayoade
Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it but I have been getting some errors. I have tried debugging the error but to no avail. the screenshot is attached below. I will be glad if responded to. thanks

Re: Loading in custom Hive jars for spark

2023-07-11 Thread Mich Talebzadeh
Are you using Spark 3.4? Under directory $SPARK_HOME get a list of jar files for hive and hadoop. This one is for version 3.4.0 /opt/spark/jars> ltr *hive* *hadoop* -rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar -rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43 sp

Loading in custom Hive jars for spark

2023-07-11 Thread Yeachan Park
Hi all, We made some changes to hive which require changes to the hive jars that Spark is bundled with. Since Spark 3.3.1 comes bundled with Hive 2.3.9 jars, we built our changes in Hive 2.3.9 and put the necessary jars under $SPARK_HOME/jars (replacing the original jars that were there

Unable to populate spark metrics using custom metrics API

2023-07-08 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Spark UI - Bug Executors tab when using proxy port

2023-07-06 Thread Bruno Pistone
Hello everyone, I’m really sorry to use this mailing list, but seems impossible to notify a strange behaviour that is happening with the Spark UI. I’m sending also the link to the stackoverflow question here https://stackoverflow.com/questions/76632692/spark-ui-executors-tab-its-empty I’m

Performance Issue with Column Addition in Spark 3.4.x: Time Doubling with Increased Columns

2023-07-04 Thread KO Dukhyun
Dear spark users, I'm experiencing an unusual issue with Spark 3.4.x. When creating a new column as the sum of several existing columns, the time taken almost doubles as the number of columns increases. This operation doesn't require much resources, so I suspect there might be a pr

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
ds = spark.read().schema(schema).option("mode", >>>> "PERMISSIVE").json("path").collect(); >>>> ds = ds.filter(functions.col("_corrupt_record").isNull()).collect(); >>>> >>>> However, I haven't figured out on how to

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Hill Liu
er, I haven't figured out on how to ignore record 5. >>> >>> Any help is appreciated. >>> >>> On Mon, 3 Jul 2023 at 19:24, Shashank Rao >>> wrote: >>> >>>> Hi all, >>>> I'm trying to read around 1,000,000 JSONL

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
ctions.col("_corrupt_record").isNull()).collect(); >> >> However, I haven't figured out on how to ignore record 5. >> >> Any help is appreciated. >> >> On Mon, 3 Jul 2023 at 19:24, Shashank Rao >> wrote: >> >>> Hi all, >&g

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Vikas Kumar
> > On Mon, 3 Jul 2023 at 19:24, Shashank Rao wrote: > >> Hi all, >> I'm trying to read around 1,000,000 JSONL files present in S3 using >> Spark. Once read, I need to write them to BigQuery. >> I have a schema that may not be an exact match with all the

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing! On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking to

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Farshid Ashouri
This is wonderful news! On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK > <

Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gengliang Wang
Dear Apache Spark community, We are delighted to announce the launch of a groundbreaking tool that aims to make Apache Spark more user-friendly and accessible - the English SDK <https://github.com/databrickslabs/pyspark-ai/>. Powered by the application of Generative AI, the English SDK

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
ds.filter(functions.col("_corrupt_record").isNull()).collect(); However, I haven't figured out on how to ignore record 5. Any help is appreciated. On Mon, 3 Jul 2023 at 19:24, Shashank Rao wrote: > Hi all, > I'm trying to read around 1,000,000 JSONL files present in S3 usin

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
Hi Ruben, I’m not sure if this answers your question, but if you’re interested in exploring the underlying tables, you could always try something like the below in a Databricks notebook: display(spark.read.table(’samples.nyctaxi.trips’)) (For vanilla Spark users, it would be spark.read.table

Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Hi all, I'm trying to read around 1,000,000 JSONL files present in S3 using Spark. Once read, I need to write them to BigQuery. I have a schema that may not be an exact match with all the records. How can I filter records where there isn't an exact schema match: Eg: if my records wer

[Spark SQL] Data objects from query history

2023-06-30 Thread Ruben Mennes
Dear Apache Spark community, I hope this email finds you well. My name is Ruben, and I am an enthusiastic user of Apache Spark, specifically through the Databricks platform. I am reaching out to you today to seek your assistance and guidance regarding a specific use case. I have been

[PySpark] Intermittent Spark session initialization error on M1 Mac

2023-06-27 Thread BeoumSuk Kim
Hi, When I launch pyspark CLI on my M1 Macbook (standalone mode), I intermittently get the following error and the Spark session doesn't get initialized. 7~8 times out of 10, it doesn't have the issue, but it intermittently fails. And, this occurs only when I specify `spark.jars.packag

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing da

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
t just be the reality of trying to process a 240m record file > with 80+ columns, unless there's an obvious issue with my setup that > someone sees. The solution is likely going to involve increasing > parallelization. > > To that end, I extracted and re-zipped this file in bzip. Since

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
file with 80+ columns, unless there's an obvious issue with my setup that someone sees. The solution is likely going to involve increasing parallelization. To that end, I extracted and re-zipped this file in bzip. Since bzip is splittable and gzip is not, Spark can process the bzip file in par

Unable to populate spark metrics using custom metrics API

2023-06-26 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engin

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzip

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh
(df.take(1))) > 0: print(batchId) # do your logic else: print("DataFrame is empty") You should also have it in option('checkpointLocation', checkpoint_path). See this article on mine Processing Change Data Capture with Spark Structured S

[Spark streaming]: Microbatch id in logs

2023-06-25 Thread Anil Dasari
Hi, I am using spark 3.3.1 distribution and spark stream in my application. Is there a way to add a microbatch id to all logs generated by spark and spark applications ? Thanks.

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread yangjie01
to:mri...@gmail.com>> wrote: >> >> >> Thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun > <mailto:dongj...@apache.org>> wrote: >>> >>> We are happy to announce the availability of A

Re:[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread beliefer
Thanks! Dongjoon Hyun. Congratulation too! At 2023-06-24 07:57:05, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branc

Apache Spark with watermark - processing data different LogTypes in same kafka topic

2023-06-24 Thread karan alang
Hello All - I'm using Apache Spark Structured Streaming to read data from Kafka topic, and do some processing. I'm using watermark to account for late-coming records and the code works fine. Here is the working(sample) code: ``` from pyspark.sql import SparkSessionfrom pyspark.sql

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh
M Dongjoon Hyun wrote: >>> >>> We are happy to announce the availability of Apache Spark 3.4.1! >>> >>> Spark 3.4.1 is a maintenance release containing stability fixes. This >>> release is based on the branch-3.4 maintenance branch of Spark. We strongly

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Mridul Muralidharan
Thanks Dongjoon ! Regards, Mridul On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > We are happy to announce the availability of Apache Spark 3.4.1! > > Spark 3.4.1 is a maintenance release containing stability fixes. This > release is based on the branch-3.4 maintenance bra

[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this stable release. To download Spark 3.4.1

Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
> HI > > I am using spark with iceberg, updating the table with 1700 columns , > We are loading 0.6 Million rows from parquet files ,in future it will be > 16 Million rows and trying to update the data in the table which has 16 > buckets . > Using the default partitioner of sp

Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
HI I am using spark with iceberg, updating the table with 1700 columns , We are loading 0.6 Million rows from parquet files ,in future it will be 16 Million rows and trying to update the data in the table which has 16 buckets . Using the default partitioner of spark .Also we don't d

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
at group) > > You need to raise a JIRA for this request plus related doc related > > > Example JIRA > > https://issues.apache.org/jira/browse/SPARK-42485 > > and the related *Spark project improvement proposals (SPIP) *to be filled > in > > https://spark.apache.org/imp

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack
Sean is right, casting timestamps to strings (which is what show() does) uses the local timezone, either the Java default zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or the default DataFrameWriter zone `timeZone`(when writing to file). You say you are in PST

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : > https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-f

Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread karan alang
ref : https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly Hello All, I've data stored in MongoDB collection and the timestamp column is not being read by Apache Spark correctly. I'm running Apache Spark on GCP Dataproc. Here is s

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh
Try sending it to d...@spark.apache.org (and join that group) You need to raise a JIRA for this request plus related doc related Example JIRA https://issues.apache.org/jira/browse/SPARK-42485 and the related *Spark project improvement proposals (SPIP) *to be filled in https

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread keen
Do Spark **devs** read this mailing list? Is there another/a better way to make feature requests? I tried in the past to write a mail to the dev mailing list but it did not show at all. Cheers keen schrieb am Do., 1. Juni 2023, 07:11: > Hi all, > currently only *temporary* Spark Views

[Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-01 Thread keen
Hi all, currently only *temporary* Spark Views can be created from a DataFrame (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView). When I want a *permanent* Spark View I need to specify it via Spark SQL (CREATE VIEW AS SELECT ...). Sometimes it is easier to specify the desired

Re: ChatGPT and prediction of Spark future

2023-06-01 Thread Mich Talebzadeh
Great stuff Winston. I added a channel in Slack Community for Spark https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT cheers Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <ht

Comparison of Trino, Spark, and Hive-MR3

2023-05-31 Thread Sungwoo Park
Hi everyone, We published an article on the performance and correctness of Trino, Spark, and Hive-MR3, and thought that it could be of interest to Spark users. https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/ Omitted in the article is the performance of Spark 2.3.1 vs

Re: Viewing UI for spark jobs running on K8s

2023-05-31 Thread Qian Sun
Hi Nikhil Spark operator supports ingress for exposing all UIs of running spark applications. reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress On Thu, Jun 1, 2023 at 6:19 AM Nikhil Goyal wrote: >

Re: ChatGPT and prediction of Spark future

2023-05-31 Thread Winston Lai
Hi Mich, I have been using ChatGPT free version, Bing AI, Google Bard and other AI chatbots. My use cases so far include writing, debugging code, generating documentation and explanation on Spark key terminologies for beginners to quickly pick up new concepts, summarizing pros and cons or

Viewing UI for spark jobs running on K8s

2023-05-31 Thread Nikhil Goyal
Hi folks, Is there an equivalent of the Yarn RM page for Spark on Kubernetes. We can port-forward the UI from the driver pod for each but this process is tedious given we have multiple jobs running. Is there a clever solution to exposing all Driver UIs in a centralized place? Thanks Nikhil

ChatGPT and prediction of Spark future

2023-05-31 Thread Mich Talebzadeh
I have started looking into ChatGPT as a consumer. The one I have tried is the free not plus version. I asked a question entitled "what is the future for spark" and asked for a concise response This was the answer "Spark has a promising future due to its capabilities in

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-29 Thread Aishwarya Panicker
Hi, Thanks for your response. I understand there is no explicit way to configure dynamic scaling for Spark Structured Streaming as the ticket is still open for that. But is there a way to manage dynamic scaling with the existing Batch Dynamic scaling algorithm as this kicks in when Dynamic

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
2.13.8 <https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/pom.xml#LL3606C46-L3606C46> you must change 2.13.6 to 2.13.8 man. 29. mai 2023 kl. 18:02 skrev Mich Talebzadeh : > Thanks everyone. Still not much progress :(. It is becoming a bit > confusing as

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
parkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apac

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
Change org.scala-lang scala-library 2.13.11-M2 to org.scala-lang scala-library ${scala.version} man. 29. mai 2023 kl. 13:20 skrev Lingzhe Sun : > Hi Mich, > > Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8 > <https://github.com/ap

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Lingzhe Sun
Hi Mich, Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8. Since you are using spark-core_2.13 and spark-sql_2.13, you should stick to the major(13) and the minor version(8). Not using any of these may cause unexpected behaviour(though scala claims compatibility among minor

Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org $apache$sp

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen
ion` property matches the version of `scala-library` defined in your dependencies. In your case, you may want to change `scala.version` to `2.13.6`. Here's the corrected part of your POM: ```xml 1.7 1.7 UTF-8 2.13.6 2.15.2 ``` Additionally, ensure that the Scala versions in the

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh
Hi, Autoscaling is not compatible with Spark Structured Streaming <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> since Spark Structured Streaming currently does not support dynamic allocation (see SPARK-24815: Structured Streaming should support d

[Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Aishwarya Panicker
Hi Team, I have been working on Spark Structured Streaming and trying to autoscale our application through dynamic allocation. But I couldn't find any documentation or configurations that supports dynamic scaling in Spark Structured Streaming, due to which I had been using Spark Batch

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-24 Thread Enrico Minack
a shuffle involved. Raghavendra proposed to iterate the partitions. I presume that you partition by "Date" and order within partition by "Row", which puts multiple dates into one partition. Even if you have one date per partition, AQE might coale

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh
can you guys > suggest a method to this scenario. Also for your concern this dataset is > really large; it has around 1 records and I am using spark with > scala > > Thank You, > Best Regards > > > <https://www.avast.com/sig-email?utm_medium=email&utm_sourc

Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Nipuna Shantha
as a partition so row numbers cannot jumble. So can you guys suggest a method to this scenario. Also for your concern this dataset is really large; it has around 1 records and I am using spark with scala Thank You, Best Regards <https://www.avast.com/sig-email?utm_medium=email&utm

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Just to correct the last sentence, if we end up starting a new instance of Spark, I don't think it will be able to read the shuffle data from storage from another instance, I stand corrected. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London U

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
back to the source. In case of failure, there will be delay in getting the results. The amount of delay depends on how much reprocessing Spark needs to do. Spark, by itself, doesn't add executors when executors fail. It just moves the tasks to other executors. If you are installing plain va

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M
levant. We are running spark standalone mode. Best regards, maksym On 2023/05/17 12:28:35 vaquar khan wrote: > Following link you will get all required details > > https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ > > Let me know if you

Re: Spark shuffle and inevitability of writing to Disk

2023-05-17 Thread Mich Talebzadeh
functions as F from pyspark.sql.functions import * from pyspark.sql.types import StructType, StructField, StringType,IntegerType, FloatType, TimestampType import time def main(): appName = "skew" spark = SparkSession.builder.appName(appName).getOrCreate() spa

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
Following link you will get all required details https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ Let me know if you required further informations. Regards, Vaquar khan On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh wrote: > Couple of points >

RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info
: Understanding Spark S3 Read Performance Hi,I'm trying to set up a Spark pipeline which reads data from S3 and writes it into Google Big Query.Environment Details:---Java 8AWS EMR-6.10.0Spark v3.3.12 m5.xlarge executor nodesS3 Directory structure:--- bucket-name:|---fo

Understanding Spark S3 Read Performance

2023-05-16 Thread Shashank Rao
Hi, I'm trying to set up a Spark pipeline which reads data from S3 and writes it into Google Big Query. Environment Details: --- Java 8 AWS EMR-6.10.0 Spark v3.3.1 2 m5.xlarge executor nodes S3 Directory structure: --- bucket-name: |---folder1: |---fo

Spark shuffle and inevitability of writing to Disk

2023-05-16 Thread Mich Talebzadeh
Hi, On the issue of Spark shuffle it is accepted that shuffle *often involves* the following if not all below: - Disk I/O - Data serialization and deserialization - Network I/O Excluding external shuffle service and without relying on the configuration options provided by spark for

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
On Mon, 15 May 2023 at 13:11, Faiz Halde wrote: > Hello, > > We've been in touch with a few spark specialists who suggested us a > potential solution to improve the reliability of our jobs that are shuffle > heavy > > Here is what our setup looks like > >-

[spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Faiz Halde
Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy Here is what our setup looks like - Spark version: 3.3.1 - Java version: 1.8 - We do not use external shuffle service - W

Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-11 Thread Vijay B
In my view spark is behaving as expected. TL:DR Every time a dataframe is reused or branched or forked the sequence operations evaluated run again. Use Cache or persist to avoid this behavior and un-persist when no longer required, spark does not un-persist automatically. Couple of things

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
Please see if this works -- aggregate array into map of element of count SELECT aggregate(array(1,2,3,4,5), map('cnt',0), (acc,x) -> map('cnt', acc.cnt+1)) as array_count thanks Vijay On 2023/05/05 19:32:04 Yong Zhang wrote: > Hi, This is on Spark 3.1 environment.

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When I run this job in local mode spark-submit --master local[4] with spark = SparkSession.builder \ .appName("tests") \ .enableHiveSupport() \ .getOrCreate() spark.conf.set("spark.sql.adaptive.enabled", "true") df3.explain(extended=Tru

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
a'), map(), (acc, x) -> ???, acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
v], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` On Mon, May 8, 2023 at 1:07 AM Mich Talebzadeh wrote: > When I run this job in local mode spark-submit --master local[4] > > with > > spark = SparkSession.builder \ > .appName(&

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
map(), (acc, x) -> ???, acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't wor

<    2   3   4   5   6   7   8   9   10   11   >