Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increasing

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark

Unsubscribe

2023-07-17 Thread Bode, Meikel
Unsubscribe

Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Resending this message with a proper Subject line Hi Spark Community, I am trying to set up my forked apache/spark project locally for my 1st Open Source Contribution, by building and creating a package as mentioned here under Running Individual Tests

Re: Unsubscribe

2023-07-17 Thread srini subramanian
Unsubscribe  On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel wrote: Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Hi All, Of late, I have encountered the issue where I have to overwrite a lot of partitions of the Hive table through Spark. It looks like writing to hive_staging_directory takes 25% of the total time, whereas 75% or more time goes in moving the ORC files from staging directory to the final

Unsubscribe

2023-07-16 Thread Bode, Meikel
Unsubscribe

Re: Contributing to Spark MLLib

2023-07-16 Thread Brian Huynh
Good morning Dipayan, Happy to see another contributor! Please go through this document for contributors. Please note the MLlib-specific contribution guidelines section in particular. https://spark.apache.org/contributing.html Since you are looking for something to start with, take a look at

Contributing to Spark MLLib

2023-07-16 Thread Dipayan Dev
Hi Spark Community, A very good morning to you. I am using Spark from last few years now, and new to the community. I am very much interested to be a contributor. I am looking to contribute to Spark MLLib. Can anyone please suggest me how to start with contributing to any new MLLib feature? Is

[no subject]

2023-07-16 Thread Varun Shah
Hi Spark Community, I am trying to setup my forked apache/spark project locally by building and creating a package as mentioned here under Running Individual Tests . Here are the steps I have followed: >> .build/sbt # this

[Spark RPC]: Yarn - Application Master / executors to Driver communication issue

2023-07-14 Thread Sunayan Saikia
Hey Spark Community, Our Jupyterhub/Jupyterlab (with spark client) runs behind two layers of HAProxy and the Yarn cluster runs remotely. We want to use deploy mode 'client' so that we can capture the output of any spark sql query in jupyterlab. I'm aware of other technologies like Livy and Spark

Re: Unable to populate spark metrics using custom metrics API

2023-07-13 Thread Surya Soma
Gentle reminder on this. On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote: > Hello, > > I am trying to publish custom metrics using Spark CustomMetric API as > supported since spark 3.2 https://github.com/apache/spark/pull/31476, > > >

Re: Spark Not Connecting

2023-07-12 Thread Artemis User
Well, in that case, you may want to make sure your Spark server is running properly and you can access the Spark UI using your browser.  If you're not owning the spark cluster, contact your spark admin. On 7/12/23 1:56 PM, timi ayoade wrote: I can't even connect to the spark UI On Wed, Jul

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it

Spark Not Connecting

2023-07-12 Thread timi ayoade
Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it but I have been getting some errors. I have tried debugging the error but to no avail. the screenshot is attached below. I will be glad if responded to. thanks

Re: Loading in custom Hive jars for spark

2023-07-11 Thread Mich Talebzadeh
Are you using Spark 3.4? Under directory $SPARK_HOME get a list of jar files for hive and hadoop. This one is for version 3.4.0 /opt/spark/jars> ltr *hive* *hadoop* -rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar -rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43

Loading in custom Hive jars for spark

2023-07-11 Thread Yeachan Park
Hi all, We made some changes to hive which require changes to the hive jars that Spark is bundled with. Since Spark 3.3.1 comes bundled with Hive 2.3.9 jars, we built our changes in Hive 2.3.9 and put the necessary jars under $SPARK_HOME/jars (replacing the original jars that were there),

Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju
We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs,

Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
Finally I was able to solve this issue by setting this conf. "spark.driver.extraJavaOptions=-Dorg.xerial.snappy.tempdir=/my_user/temp_ folder" Thanks all! On Sat, 8 Jul 2023 at 3:45 AM, Brian Huynh wrote: > Hi Khalid, > > Elango mentioned the file is working fine in our another environment

Unsubscribe

2023-07-09 Thread chen...@birdiexx.com
Unsubscribe

Unable to populate spark metrics using custom metrics API

2023-07-08 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Unsubscribe

2023-07-08 Thread yixu2...@163.com
Unsubscribe yixu2...@163.com

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Brian Huynh
Hi Khalid,Elango mentioned the file is working fine in our another environment with the same driver and executor memoryBrianOn Jul 7, 2023, at 10:18 AM, Khalid Mammadov wrote:Perhaps that parquet file is corrupted or got that is in that folder?To check, try to read that file with pandas or other

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
Perhaps that parquet file is corrupted or got that is in that folder? To check, try to read that file with pandas or other tools to see if you can read without Spark. On Wed, 5 Jul 2023, 07:25 elango vaidyanathan, wrote: > > Hi team, > > Any updates on this below issue > > On Mon, 3 Jul 2023 at

Re: Unsubscribe

2023-07-07 Thread Atheeth SH
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Thanks On Fri, 7 Jul 2023 at 12:05, Mihai Musat wrote: > Unsubscribe >

Unsubscribe

2023-07-07 Thread Mihai Musat
Unsubscribe

Spark UI - Bug Executors tab when using proxy port

2023-07-06 Thread Bruno Pistone
Hello everyone, I’m really sorry to use this mailing list, but seems impossible to notify a strange behaviour that is happening with the Spark UI. I’m sending also the link to the stackoverflow question here https://stackoverflow.com/questions/76632692/spark-ui-executors-tab-its-empty I’m

Re: PySpark error java.lang.IllegalArgumentException

2023-07-05 Thread elango vaidyanathan
Hi team, Any updates on this below issue On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan wrote: > > > Hi all, > > I am reading a parquet file like this and it gives > java.lang.IllegalArgumentException. > However i can work with other parquet files (such as nyc taxi parquet > files)

Performance Issue with Column Addition in Spark 3.4.x: Time Doubling with Increased Columns

2023-07-04 Thread KO Dukhyun
Dear spark users, I'm experiencing an unusual issue with Spark 3.4.x. When creating a new column as the sum of several existing columns, the time taken almost doubles as the number of columns increases. This operation doesn't require much resources, so I suspect there might be a problem with

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Z is just an example. It could be anything. Basically, anything that's not in schema should be filtered out. On Tue, 4 Jul 2023, 13:27 Hill Liu, wrote: > I think you can define schema with column z and filter out records with z > is null. > > On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao >

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Hill Liu
I think you can define schema with column z and filter out records with z is null. On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao wrote: > Yes, drop malformed does filter out record4. However, record 5 is not. > > On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > >> Have you tried dropmalformed

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Yes, drop malformed does filter out record4. However, record 5 is not. On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > Have you tried dropmalformed option ? > > On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > >> Update: Got it working by using the *_corrupt_record *field for the >> first

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Vikas Kumar
Have you tried dropmalformed option ? On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > Update: Got it working by using the *_corrupt_record *field for the first > case (record 4) > > schema = schema.add("_corrupt_record", DataTypes.StringType); > Dataset ds =

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing! On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Farshid Ashouri
This is wonderful news! On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gengliang Wang
Dear Apache Spark community, We are delighted to announce the launch of a groundbreaking tool that aims to make Apache Spark more user-friendly and accessible - the English SDK . Powered by the application of Generative AI, the English SDK

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Update: Got it working by using the *_corrupt_record *field for the first case (record 4) schema = schema.add("_corrupt_record", DataTypes.StringType); Dataset ds = spark.read().schema(schema).option("mode", "PERMISSIVE").json("path").collect(); ds =

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
Hi Ruben, I’m not sure if this answers your question, but if you’re interested in exploring the underlying tables, you could always try something like the below in a Databricks notebook: display(spark.read.table(’samples.nyctaxi.trips’)) (For vanilla Spark users, it would be

Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Hi all, I'm trying to read around 1,000,000 JSONL files present in S3 using Spark. Once read, I need to write them to BigQuery. I have a schema that may not be an exact match with all the records. How can I filter records where there isn't an exact schema match: Eg: if my records were: {"x": 1,

CFP for the 2nd Performance Engineering track at Community over Code NA 2023

2023-07-03 Thread Brebner, Paul
Hi Apache Spark people - There are only 10 days left to submit a talk proposal (title and abstract only) for Community over Code NA 2023 - the 2nd Performance Engineering track is on this year so any Apache project-related performance and scalability talks are welcome, here's the CFP for more

PySpark error java.lang.IllegalArgumentException

2023-07-03 Thread elango vaidyanathan
Hi all, I am reading a parquet file like this and it gives java.lang.IllegalArgumentException. However i can work with other parquet files (such as nyc taxi parquet files) without any issue. I have copied the full error log as well. Can you please check once and let me know how to fix this?

[Spark SQL] Data objects from query history

2023-06-30 Thread Ruben Mennes
Dear Apache Spark community, I hope this email finds you well. My name is Ruben, and I am an enthusiastic user of Apache Spark, specifically through the Databricks platform. I am reaching out to you today to seek your assistance and guidance regarding a specific use case. I have been

checkpoint file deletion

2023-06-29 Thread Lingzhe Sun
Hi all, I performed a stateful structure streaming job, and configured spark.cleaner.referenceTracking.cleanCheckpoints to true spark.cleaner.periodicGC.interval to 1min in the config. But the checkpoint files are not deleted and the number of them keeps growing. Did I miss something?

Unsubscribe

2023-06-29 Thread lee
Unsubscribe | | 李杰 | | leedd1...@163.com |

Unsubscribe

2023-06-28 Thread Ghazi Naceur
Unsubscribe

Re:subscribe

2023-06-28 Thread mojianan2015
test At 2023-06-29 10:21:56, "mojianan2015" wrote: Test | | mojianan2015 | | mojianan2...@163.com |

subscribe

2023-06-28 Thread mojianan2015
Test | | mojianan2015 | | mojianan2...@163.com |

[PySpark] Intermittent Spark session initialization error on M1 Mac

2023-06-27 Thread BeoumSuk Kim
Hi, When I launch pyspark CLI on my M1 Macbook (standalone mode), I intermittently get the following error and the Spark session doesn't get initialized. 7~8 times out of 10, it doesn't have the issue, but it intermittently fails. And, this occurs only when I specify `spark.jars.packages` option.

[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
Hi Team, I have no luck in trying to expose port 5005 (for remote debugging purpose) on my executor container using the following pod template and spark configuration s3a://mybucket/pod-template-executor-debug.yaml

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing dataframe as

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK, good news. You have made some progress here :) bzip (bzip2) works (splittable) because it is block-oriented whereas gzip is stream oriented. I also noticed that you are creating a managed ORC file. You can bucket and partition an ORC (Optimized Row Columnar file format. An example below:

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hi Mich, Thanks for the reply. I started running ANALYZE TABLE on the external table, but the progress was very slow. The stage had only read about 275MB in 10 minutes. That equates to about 5.5 hours just to analyze the table. This might just be the reality of trying to process a 240m record

Unable to populate spark metrics using custom metrics API

2023-06-26 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engineering

Unsubscribe

2023-06-26 Thread Ghazi Naceur
Unsubscribe

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh
In SSS writeStream. \ outputMode('append'). \ option("truncate", "false"). \ * foreachBatch(SendToBigQuery). \* option('checkpointLocation', checkpoint_path). \ so this writeStream will call

[Spark streaming]: Microbatch id in logs

2023-06-25 Thread Anil Dasari
Hi, I am using spark 3.3.1 distribution and spark stream in my application. Is there a way to add a microbatch id to all logs generated by spark and spark applications ? Thanks.

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread yangjie01
Thanks Dongjoon ~ 在 2023/6/24 10:29,“L. C. Hsieh”mailto:vii...@gmail.com>> 写入: Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon mailto:gurwls...@apache.org>> wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan > wrote: >> >> >>

Re:[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread beliefer
Thanks! Dongjoon Hyun. Congratulation too! At 2023-06-24 07:57:05, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark.

Apache Spark with watermark - processing data different LogTypes in same kafka topic

2023-06-24 Thread karan alang
Hello All - I'm using Apache Spark Structured Streaming to read data from Kafka topic, and do some processing. I'm using watermark to account for late-coming records and the code works fine. Here is the working(sample) code: ``` from pyspark.sql import SparkSessionfrom pyspark.sql.functions

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh
Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: >> >> >> Thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: >>> >>> We are happy to announce the

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Mridul Muralidharan
Thanks Dongjoon ! Regards, Mridul On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > We are happy to announce the availability of Apache Spark 3.4.1! > > Spark 3.4.1 is a maintenance release containing stability fixes. This > release is based on the branch-3.4 maintenance branch of Spark.

[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this stable release. To download Spark 3.4.1,

Re: Rename columns without manually setting them all

2023-06-21 Thread Bjørn Jørgensen
data = { "Employee ID": [12345, 12346, 12347, 12348, 12349], "Name": ["Dummy x", "Dummy y", "Dummy z", "Dummy a", "Dummy b"], "Client": ["Dummy a", "Dummy b", "Dummy c", "Dummy d", "Dummy e"], "Project": ["abc", "def", "ghi", "jkl", "mno"], "Team": ["team a", "team b", "team

Re: Rename columns without manually setting them all

2023-06-21 Thread Farshid Ashouri
You can use selectExpr and stack to achieve the same effect in PySpark: df = spark.read.csv("your_file.csv", header=True, inferSchema=True) date_columns = [col for col in df.columns if '/' in col] df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`", "`Team`”] +

Rename columns without manually setting them all

2023-06-21 Thread John Paul Jayme
Hi, This is currently my column definition : Employee ID NameClient Project Team01/01/2022 02/01/2022 03/01/2022 04/01/2022 05/01/2022 12345 Dummy x Dummy a abc team a OFF WO WH WH WH As you can see, the outer columns are just

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks for the info. Regards Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:*

Unsubscribe

2023-06-20 Thread Bhargava Sukkala
-- Thanks, Bhargava Sukkala. Cell no:216-278-1066 MS in Business Analytics, Arizona State University.

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
yes, p_df = DF.toPandas() that is THE pandas the one you know. change p_df = DF.toPandas() to p_df = DF.pandas_on_spark() or p_df = DF.to_pandas_on_spark() or p_df = DF.pandas_api() or p_df = DF.to_koalas() https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks So the issue seems to be creating a Panda DF from Spark DF (I do it for plotting with something like import matplotlib.pyplot as plt p_df = DF.toPandas() p_df.plt() I guess that stays in the driver. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir

Re: Shuffle data on pods which get decomissioned

2023-06-20 Thread Mich Talebzadeh
If one executor fails, it moves the processing over to another executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go back to the source.Does this mean that only those tasks that the dead executor was executing at the time need to be

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh wrote: > Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed.

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Thanks but if you create a Spark DF from Pandas DF that Spark DF is not distributed and remains on the driver. I recall a while back we had this conversation. I don't think anything has changed. Happy to be corrected Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
Pandas API on spark is an API so that users can use spark as they use pandas. This was known as koalas. Is this limitation still valid for Pandas? For pandas, yes. But what I did show wos pandas API on spark so its spark. Additionally when we convert from Panda DF to Spark DF, what process is

Shuffle data on pods which get decomissioned

2023-06-20 Thread Nikhil Goyal
Hi folks, When running Spark on K8s, what would happen to shuffle data if an executor is terminated or lost. Since there is no shuffle service, does all the work done by that executor gets recomputed? Thanks Nikhil

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Whenever someone mentions Pandas I automatically think of it as an excel sheet for Python. OK my point below needs some qualification Why Spark here. Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
This is pandas API on spark from pyspark import pandas as ps df = ps.read_excel("testexcel.xlsx") [image: image.png] this will convert it to pyspark [image: image.png] tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme : > Good day, > > > > I have a task to read excel files in databricks but I

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing the API documents -

How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
Good day, I have a task to read excel files in databricks but I cannot seem to proceed. I am referencing the API documents - read_excel , but there is an error sparksession object has

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
OK the number of partitions n or more to the point the "optimum" no of partitions depends on the size of your batch data DF among other things and the degree of parallelism at the end point where you will be writing to sink. If you require high parallelism because your tasks are fine grained, then

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
Is this the point you are trying to implement? I have state data source which enables the state in SS --> Structured Streaming to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of

implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Pengfei Li
Hi All, I'm developing a DataSource on Spark 3.2 to write data to our system, and using DataSource V2 API. I want to implement the interface RequiresDistributionAndOrdering

TAC Applications for Community Over Code North America and Asia now open

2023-06-16 Thread Gavin McDonald
Hi All, (This email goes out to all our user and dev project mailing lists, so you may receive this email more than once.) The Travel Assistance Committee has opened up applications to help get people to the following events: *Community Over Code Asia 2023 - * *August 18th to August 20th in

Fwd: iceberg queries

2023-06-15 Thread Gaurav Agarwal
Hi Team, Sample Merge query: df.createOrReplaceTempView("source") MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target USING (SELECT * FROM source) ON target.col1 = source.col1// this is my bucket column WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * The source dataset

Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
> HI > > I am using spark with iceberg, updating the table with 1700 columns , > We are loading 0.6 Million rows from parquet files ,in future it will be > 16 Million rows and trying to update the data in the table which has 16 > buckets . > Using the default partitioner of spark .Also we don't do

Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
HI I am using spark with iceberg, updating the table with 1700 columns , We are loading 0.6 Million rows from parquet files ,in future it will be 16 Million rows and trying to update the data in the table which has 16 buckets . Using the default partitioner of spark .Also we don't do any

Unsubscribe

2023-06-11 Thread Yu voidy

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to d...@spark.apache.org (and join that group) > >

Announcing the Community Over Code 2023 Streaming Track

2023-06-09 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Halifax, Nova Scotia October 7-10, 2023. The call for presentations is open now through July 13, 2023. I am one of the co-chairs for the

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack
Sean is right, casting timestamps to strings (which is what show() does) uses the local timezone, either the Java default zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or the default DataFrameWriter zone `timeZone`(when writing to file). You say you are in PST,

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : >

Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread karan alang
ref : https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly Hello All, I've data stored in MongoDB collection and the timestamp column is not being read by Apache Spark correctly. I'm running Apache Spark on GCP Dataproc. Here is sample data :

Getting SparkRuntimeException: Unexpected value for length in function slice: length must be greater than or equal to 0

2023-06-06 Thread Bariudin, Daniel
I'm using Pyspark (version 3.2) and I've encountered the following exception while trying to perform a slice on array in a DataFrame: "org.apache.spark.SparkRuntimeException: Unexpected value for length in function slice: length must be greater than or equal to 0" but the length is grater then

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh
Try sending it to d...@spark.apache.org (and join that group) You need to raise a JIRA for this request plus related doc related Example JIRA https://issues.apache.org/jira/browse/SPARK-42485 and the related *Spark project improvement proposals (SPIP) *to be filled in

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread keen
Do Spark **devs** read this mailing list? Is there another/a better way to make feature requests? I tried in the past to write a mail to the dev mailing list but it did not show at all. Cheers keen schrieb am Do., 1. Juni 2023, 07:11: > Hi all, > currently only *temporary* Spark Views can be

<    6   7   8   9   10   11   12   13   14   15   >