subject:"Re\:"

Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells

Hi Ruben, I’m not sure if this answers your question, but if you’re interested in exploring the underlying tables, you could always try something like the below in a Databricks notebook: display(spark.read.table(’samples.nyctaxi.trips’)) (For vanilla Spark users, it would be

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh

t just be the reality of trying to process a 240m record file > with 80+ columns, unless there's an obvious issue with my setup that > someone sees. The solution is likely going to involve increasing > parallelization. > > To that end, I extracted and re-zipped this file in bzip. Since bzip is &g

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci

file with 80+ columns, unless there's an obvious issue with my setup that someone sees. The solution is likely going to involve increasing parallelization. To that end, I extracted and re-zipped this file in bzip. Since bzip is splittable and gzip is not, Spark can process the bzip file in parallel

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh

OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engineering

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh

In SSS writeStream. \ outputMode('append'). \ option("truncate", "false"). \ * foreachBatch(SendToBigQuery). \* option('checkpointLocation', checkpoint_path). \ so this writeStream will call

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread yangjie01

Thanks Dongjoon ~ 在 2023/6/24 10:29，“L. C. Hsieh”mailto:vii...@gmail.com>> 写入: Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon mailto:gurwls...@apache.org>> wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan > wrote: >> >> >>

Re:[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread beliefer

Thanks! Dongjoon Hyun. Congratulation too! At 2023-06-24 07:57:05, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.1! Spark 3.4.1 is a maintenance release containing stability fixes. This release is based on the branch-3.4 maintenance branch of Spark.

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh

Thanks Dongjoon! On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon wrote: > > Thanks! > > On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: >> >> >> Thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: >>> >>> We are happy to announce the

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon

Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Mridul Muralidharan

Thanks Dongjoon ! Regards, Mridul On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > We are happy to announce the availability of Apache Spark 3.4.1! > > Spark 3.4.1 is a maintenance release containing stability fixes. This > release is based on the branch-3.4 maintenance branch of Spark.

Re: Rename columns without manually setting them all

2023-06-21 Thread Bjørn Jørgensen

data = { "Employee ID": [12345, 12346, 12347, 12348, 12349], "Name": ["Dummy x", "Dummy y", "Dummy z", "Dummy a", "Dummy b"], "Client": ["Dummy a", "Dummy b", "Dummy c", "Dummy d", "Dummy e"], "Project": ["abc", "def", "ghi", "jkl", "mno"], "Team": ["team a", "team b", "team

Re: Rename columns without manually setting them all

2023-06-21 Thread Farshid Ashouri

You can use selectExpr and stack to achieve the same effect in PySpark: df = spark.read.csv("your_file.csv", header=True, inferSchema=True) date_columns = [col for col in df.columns if '/' in col] df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`", "`Team`”] +

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh

OK thanks for the info. Regards Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:*

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen

yes, p_df = DF.toPandas() that is THE pandas the one you know. change p_df = DF.toPandas() to p_df = DF.pandas_on_spark() or p_df = DF.to_pandas_on_spark() or p_df = DF.pandas_api() or p_df = DF.to_koalas() https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh

OK thanks So the issue seems to be creating a Panda DF from Spark DF (I do it for plotting with something like import matplotlib.pyplot as plt p_df = DF.toPandas() p_df.plt() I guess that stays in the driver. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir

Re: Shuffle data on pods which get decomissioned

2023-06-20 Thread Mich Talebzadeh

If one executor fails, it moves the processing over to another executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go back to the source.Does this mean that only those tasks that the dead executor was executing at the time need

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen

No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh wrote: > Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed.

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh

Thanks but if you create a Spark DF from Pandas DF that Spark DF is not distributed and remains on the driver. I recall a while back we had this conversation. I don't think anything has changed. Happy to be corrected Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen

Pandas API on spark is an API so that users can use spark as they use pandas. This was known as koalas. Is this limitation still valid for Pandas? For pandas, yes. But what I did show wos pandas API on spark so its spark. Additionally when we convert from Panda DF to Spark DF, what process is

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh

Whenever someone mentions Pandas I automatically think of it as an excel sheet for Python. OK my point below needs some qualification Why Spark here. Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen

This is pandas API on spark from pyspark import pandas as ps df = ps.read_excel("testexcel.xlsx") [image: image.png] this will convert it to pyspark [image: image.png] tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme : > Good day, > > > > I have a task to read excel files in databricks but I

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen

It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing the API documents -

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh

OK the number of partitions n or more to the point the "optimum" no of partitions depends on the size of your batch data DF among other things and the degree of parallelism at the end point where you will be writing to sink. If you require high parallelism because your tasks are fine grained, then

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh

Is this the point you are trying to implement? I have state data source which enables the state in SS --> Structured Streaming to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of

Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal

> HI > > I am using spark with iceberg, updating the table with 1700 columns , > We are loading 0.6 Million rows from parquet files ,in future it will be > 16 Million rows and trying to update the data in the table which has 16 > buckets . > Using the default partitioner of spark .Also we don't do

Re: [Feature Request] create permanent Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan

DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to d...@spark.apache.org (and join that group) > >

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack

Sean is right, casting timestamps to strings (which is what show() does) uses the local timezone, either the Java default zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or the default DataFrameWriter zone `timeZone`(when writing to file). You say you are in PST,

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen

You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : >

Re: [Feature Request] create permanent Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh

Try sending it to d...@spark.apache.org (and join that group) You need to raise a JIRA for this request plus related doc related Example JIRA https://issues.apache.org/jira/browse/SPARK-42485 and the related *Spark project improvement proposals (SPIP) *to be filled in

Re: [Feature Request] create permanent Spark View from DataFrame via PySpark

2023-06-04 Thread keen

Do Spark **devs** read this mailing list? Is there another/a better way to make feature requests? I tried in the past to write a mail to the dev mailing list but it did not show at all. Cheers keen schrieb am Do., 1. Juni 2023, 07:11: > Hi all, > currently only *temporary* Spark Views can be

Re: ChatGPT and prediction of Spark future

2023-06-01 Thread Mich Talebzadeh

Great stuff Winston. I added a channel in Slack Community for Spark https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT cheers Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Viewing UI for spark jobs running on K8s

2023-05-31 Thread Qian Sun

Hi Nikhil Spark operator supports ingress for exposing all UIs of running spark applications. reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress On Thu, Jun 1, 2023 at 6:19 AM Nikhil Goyal wrote: > Hi

Re: ChatGPT and prediction of Spark future

2023-05-31 Thread Winston Lai

Hi Mich, I have been using ChatGPT free version, Bing AI, Google Bard and other AI chatbots. My use cases so far include writing, debugging code, generating documentation and explanation on Spark key terminologies for beginners to quickly pick up new concepts, summarizing pros and cons or

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-29 Thread Aishwarya Panicker

Hi, Thanks for your response. I understand there is no explicit way to configure dynamic scaling for Spark Structured Streaming as the ticket is still open for that. But is there a way to manage dynamic scaling with the existing Batch Dynamic scaling algorithm as this kicks in when Dynamic

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen

t;> <https://github.com/apache/spark/blob/88f69d6f92860823b1a90bc162ebca2b7c8132fc/pom.xml#L170>. >>> Since you are using spark-core_2.13 and spark-sql_2.13, you should stick to >>> the major(13) and the minor version(8). Not using any of these may cause >>> unex

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh

> upgrade of scala itself.). >> And although I did not encountered such problem, this >> <https://stackoverflow.com/a/26411339/19476830>can be a a pitfall for >> you. >> >> -- >> Best Regards! >> >> ......

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen

/26411339/19476830>can be a a pitfall for you. > > -- > Best Regards! > ... > Lingzhe Sun > Hirain Technology > > > *From:* Mich Talebzadeh > *Date:* 2023-05-29 17:55 > *To:* Bjørn Jørgense

Re: JDK version support information

2023-05-29 Thread Sean Owen

Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not supported. But then again 2.x is not supported either. On Mon, May 29, 2023, 6:43 AM Poorna Murali wrote: > We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. > So, we wanted to check the maximum

Re: JDK version support information

2023-05-29 Thread Poorna Murali

We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. So, we wanted to check the maximum JDK version supported for 2.4.5.1. On Mon, 29 May, 2023, 5:03 pm Aironman DirtDiver, wrote: > Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the > official Spark

Re: JDK version support information

2023-05-29 Thread Aironman DirtDiver

Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the official Spark documentation for version 2.4.5, the maximum supported JDK (Java Development Kit) version is JDK 8 (Java 8). Spark 2.4.5 is not compatible with JDK versions higher than Java 8. Therefore, you should use JDK 8 to

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Lingzhe Sun

! ... Lingzhe Sun Hirain Technology From: Mich Talebzadeh Date: 2023-05-29 17:55 To: Bjørn Jørgensen CC: user @spark Subject: Re: maven with Spark 3.4.0 fails compilation Thanks for your helpful comments Bjorn. I managed to compile the code

Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh

Thanks for your helpful comments Bjorn. I managed to compile the code with maven but when it run it fails with Application is ReduceByKey Exception in thread "main" java.lang.NoSuchMethodError: scala.package$.Seq()Lscala/collection/immutable/Seq$; at

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen

>From chatgpt4 The problem appears to be that there is a mismatch between the version of Scala used by the Scala Maven plugin and the version of the Scala library defined as a dependency in your POM. You've defined your Scala version in your properties as `2.12.17` but you're pulling in

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh

Hi, Autoscaling is not compatible with Spark Structured Streaming since Spark Structured Streaming currently does not support dynamic allocation (see SPARK-24815: Structured Streaming should support dynamic

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen

Are you looking for https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala On Thu, May 25, 2023 at 6:54 AM Max wrote: > Good day, I'm working on an Implantation from Joint Probability Trees > (JPT) using the Spark framework. For this

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-24 Thread Enrico Minack

Hi, given your dataset: val df=Seq( (1, 20230523, "M01"), (2, 20230523, "M01"), (3, 20230523, "M01"), (4, 20230523, "M02"), (5, 20230523, "M02"), (6, 20230523, "M02"), (7, 20230523, "M01"), (8, 20230523, "M01"), (9, 20230523, "M02"), (10, 20230523, "M02"), (11, 20230523, "M02"), (12,

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh

Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd,

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID

Thanks great Rauf. Regards On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan wrote: Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window.The data will be shuffled to form group. RegardsRaouf On Fri, May 12,

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread Rauf Khan

Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window. The data will be shuffled to form group. Regards Raouf On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: > Hello, > > In Spark windowing does call

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh

here first > My thoughtsSpark replicates the partitions among multiple nodes. If one > executor fails, it moves the processing over to the other executor. > However, if the data is lost, it re-executes the processing that generated > the data, > and might have to go back to

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh

Hi Maksym. Let us understand the basics here first My thoughtsSpark replicates the partitions among multiple nodes. If one executor fails, it moves the processing over to the other executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M

Hey vaquar, The link does't explain the crucial detail we're interested in - does executor re-use the data that exists on a node from previous executor and if not, how can we configure it to do so? We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't very relevant. We

Re: Spark shuffle and inevitability of writing to Disk

2023-05-17 Thread Mich Talebzadeh

Ok, I did a bit of a test that shows that the shuffle does spill to memory then to disk if my assertion is valid. The sample code I wrote is as follows: import sys from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan

; about to get evicted, a new host is created and the EBS volume is attached >> to it >> >> When Spark assigns a new executor to the newly created instance, it >> basically can recover all the shuffle files that are already persisted in >> the migrated EBS volume &

RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info

Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall clock time? How many executors executed those 1 / 375 tasks?Cheers,Enrico Ursprüngliche Nachricht Von: Shashank Rao Datum: 16.05.23 19:48 (GMT+01:00) An: user@spark.apache.org Betreff:

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh

ration which basically means if a host is > about to get evicted, a new host is created and the EBS volume is attached > to it > > When Spark assigns a new executor to the newly created instance, it > basically can recover all the shuffle files that are already persisted in > th

Re: Error while merge in delta table

2023-05-12 Thread Farhan Misarwala

Hi Karthick, If you have confirmed that the incompatibility between Delta and spark versions is not the case, then I would say the same what Jacek said earlier, there’s not enough “data” here. To further comment on it, we would need to know more on how you are structuring your multi threaded

Re: Error while merge in delta table

2023-05-12 Thread Karthick Nk

Hi Farhan, Thank you for your response, I am using databricks with 11.3x-scala2.12. Here I am overwriting all the tables in the same database in concurrent thread, But when I do in the iterative manner it is working fine, For Example, i am having 200 tables in same database, i am overwriting the

Re: Error while merge in delta table

2023-05-11 Thread Farhan Misarwala

Hi Karthick, I think I have seen this before and this probably could be because of an incompatibility between your spark and delta versions. Or an incompatibility between the delta version you are using now vs the one you used earlier on the existing table you are merging with. Let me know if

Re: Error while merge in delta table

2023-05-11 Thread Jacek Laskowski

Hi Karthick, Sorry to say it but there's not enough "data" to help you. There should be something more above or below this exception snippet you posted that could pinpoint the root cause. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books Follow me on

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B

Please see if this works -- aggregate array into map of element of count SELECT aggregate(array(1,2,3,4,5), map('cnt',0), (acc,x) -> map('cnt', acc.cnt+1)) as array_count thanks Vijay On 2023/05/05 19:32:04 Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some reason, I can

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh

rule "The DAG overlaps wont run several times for one action" seems >>>>> not to be apocryphal. If you can shed some light on this matter I would >>>>> appreciate it >>>>> >>>>> @weiruanl...@gmail.com My datasets are

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang

ve code won't work in Spark SQL. * As I said, I am NOT running in either Scale or PySpark session, but in a pure Spark SQL. * Is it possible to do the above logic in Spark SQL, without using "exploding"? Thanks From: Mich Talebzadeh Sent: Saturda

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach

;> However, In my case here I am calling just one action. Within the >>>>>> purview of one action Spark should not rerun the overlapping parts of the >>>>>> DAG. I do not understand why the file scan is happening several times. I >>>>>> ca

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang

in Spark SQL. * As I said, I am NOT running in either Scale or PySpark session, but in a pure Spark SQL. * Is it possible to do the above logic in Spark SQL, without using "exploding"? Thanks From: Mich Talebzadeh Sent: Saturday, May 6, 2023

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh

>>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, 7 May 2023 at 14:13, Nitin Siwach >>>>>> wrote: >>>>>> >

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach

t;> >>>> On Sun, May 7, 2023 at 12:23 PM Winston Lai >>>> wrote: >>>> >>>>> When your memory is not sufficient to keep the cached data for your >>>>> jobs in two different stages, it might be read twice because Spark might >>

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh

Spark write your data from memory to disk. >>>> >>>> One way to to check is to read Spark UI. When Spark cache the data, you >>>> will see a little green dot connected to the blue rectangle in the Spark >>>> UI. If you see this green dot twice on your two stages, likely S

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach

ely Spark spill >>> the data after your first job and read it again in the second run. You can >>> also confirm it in other metrics from Spark UI. >>> >>> That is my personal understanding based on what I have read and seen on >>> my job runs. If there is any mistake, be free to correct me. >

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh

read it again in the second run. You can >> also confirm it in other metrics from Spark UI. >> >> That is my personal understanding based on what I have read and seen on >> my job runs. If there is any mistake, be free to correct me. >> >> Thank You & Best

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach

> *From:* Nitin Siwach > *Sent:* Sunday, May 7, 2023 12:22:32 PM > *To:* Vikas Kumar > *Cc:* User > *Subject:* Re: Does spark read the same file twice, if two stages are > using the same DataFrame? > > Thank you tons, Vikas :). That makes so much sens

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Winston Lai

2 PM To: Vikas Kumar Cc: User Subject: Re: Does spark read the same file twice, if two stages are using the same DataFrame? Thank you tons, Vikas :). That makes so much sense now I'm in learning phase and was just browsing through various concepts of spark with self made small examples. It didn't

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh

you can create DF from your SQL RS and work with that in Python the way you want ## you don't need all these import findspark findspark.init() from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import udf, col,

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh

So what are you intending to do with the resultset produced? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini

Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh wrote: >

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh

you can try df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json") hdfs dfs -ls /tmp/pairs.json Found 2 items -rw-r--r-- 3 hduser supergroup 0 2023-05-04 22:21 /tmp/pairs.json/_SUCCESS -rw-r--r-- 3 hduser supergroup 96 2023-05-04 22:21

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini

Hi Mich, Thank you. Are you saying this satisfies my requirement? On the other hand, I am smelling something going on. Perhaps the Spark 'part' files should not be thought of as files, but rather pieces of a conceptual file. If that is true, then your approach (of which I'm well aware) makes

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh

AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do sharding to improve read performance when writing to HCFS file systems. Let us take your code for a drive import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.functions import struct

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini

Hi Enrico, What a great answer. Thank you. Seems like I need to get comfortable with the 'struct' and then I will be golden. Thank you again, friend. Marco. On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote: > Hi, > > You could rearrange the DataFrame so that writing the DataFrame as-is >

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Enrico Minack

Hi, You could rearrange the DataFrame so that writing the DataFrame as-is produces your structure: df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int, datA string") +---++ | id|datA| +---++ | 1| a1| | 2| a2| | 3| a3| +---++ df2 = df.select(df.id,

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-05-02 Thread Trường Trần Phan An

Hi all, I have written a program and overridden two events onStageCompleted and onTaskEnd. However, these two events do not provide information on when a Task/Stage is completed. What I want to know is which Task corresponds to which stage of a DAG (the Spark history server only tells me how

Re: Change column values using several when conditions

2023-05-01 Thread Bjørn Jørgensen

you can check if the value exists by using distinct before you loop over the dataset. man. 1. mai 2023 kl. 10:38 skrev marc nicole : > Hello > > I want to change values of a column in a dataset according to a mapping > list that maps original values of that column to other new values. Each >

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen

_co...@yahoo.com> wrote: > I re-test with cifar10 example and below is the result . can advice why > lesser num_slot is faster compared with more slots? > > num_slots=20 > > 231 seconds > > > num_slots=5 > > 52 seconds > > > num_slot=1 > > 34 seco

Re: Tensorflow on Spark CPU

2023-04-30 Thread second_co...@yahoo.com.INVALID

I re-test with cifar10 example and below is the result . can advice why lesser num_slot is faster compared with more slots? num_slots=20 231 seconds num_slots=5 52 seconds num_slot=134 seconds the code is at below https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32 Do you

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen

You don't want to use CPUs with Tensorflow. If it's not scaling, you may have a problem that is far too small to distribute. On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID wrote: > Anyone successfully run native tensorflow on Spark ? i tested example at >

Re: pyspark.sql.functions.monotonically_increasing_id()

2023-04-28 Thread Winston Lai

Hi Karthick, A few points that may help you: As stated in the URL you posted, "The function is non-deterministic because its result depends on partition IDs." Hence, the generated ID is dependent on partition IDs. Based on the code snippet you provided, I didn't see the partion columns you

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Abhishek Singla

Thanks, Mich for acknowledging. Yes, I am providing the checkpoint path. I omitted it here in the code snippet. I believe this is due to spark version 3.1.x, this config is there only in versions greater than 3.2.x On Thu, Apr 27, 2023 at 9:26 PM Mich Talebzadeh wrote: > Is this all of your

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Mich Talebzadeh

Is this all of your writeStream? df.writeStream() .foreachBatch(new KafkaS3PipelineImplementation(applicationId, appConfig)) .start() .awaitTermination(); What happened to the checkpoint location? option('checkpointLocation', checkpoint_path). example checkpoint_path =

Re: What is the best way to organize a join within a foreach?

2023-04-27 Thread Amit Joshi

Hi Marco, I am not sure if you will get access to data frame inside the for each, as spark context used to be non serialized, if I remember correctly. One thing you can do. Use cogroup operation on both the dataset. This will help you have (Key- iter(v1),itr(V2). And then use for each partition

RE: Spark Kubernetes Operator

2023-04-26 Thread Aldo Culquicondor

We are welcoming contributors, as announced in the Kubernetes WG Batch https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#bookmark=id.gfgjt0nmbgjl If you are interested, you can find us in slack.k8s.io #wg-batch or ping @mwielgus on github/slack. Thanks On

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh

Again one try is worth many opinions. Try it and gather matrix from spark UI and see how it performs. On Wed, 26 Apr 2023 at 14:57, Marco Costantini < marco.costant...@rocketfncl.com> wrote: > Thanks team, > Email was just an example. The point was to illustrate that some actions > could be

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini

Thanks team, Email was just an example. The point was to illustrate that some actions could be chained using Spark's foreach. In reality, this is an S3 write and a Kafka message production, which I think is quite reasonable for spark to do. To answer Ayan's first question. Yes, all a users

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh

Indeed very valid points by Ayan. How email is going to handle 1000s of records. As a solution architect I tend to replace. Users by customers and for each order there must be products sort of many to many relationship. If I was a customer I would also be interested in product details as

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread ayan guha

Adding to what Mitch said, 1. Are you trying to send statements of all orders to all users? Or the latest order only? 2. Sending email is not a good use of spark. instead, I suggest to use a notification service or function. Spark should write to a queue (kafka, sqs...pick your choice here).

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh

Well OK in a nutshell you want the result set for every user prepared and email to that user right. This is a form of ETL where those result sets need to be posted somewhere. Say you create a table based on the result set prepared for each user. You may have many raw target tables at the end of

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini

Hi Mich, First, thank you for that. Great effort put into helping. Second, I don't think this tackles the technical challenge here. I understand the windowing as it serves those ranks you created, but I don't see how the ranks contribute to the solution. Third, the core of the challenge is about

Re: unsubscribe

2023-04-25 Thread santhosh Gandhe

To remove your address from the list, send a message to: On Mon, Apr 24, 2023 at 10:41 PM wrote: > unsubscribe

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh

Hi Marco, First thoughts. foreach() is an action operation that is to iterate/loop over each element in the dataset, meaning cursor based. That is different from operating over the dataset as a set which is far more efficient. So in your case as I understand it correctly, you want to get order

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini

Thanks Mich, Great idea. I have done it. Those files are attached. I'm interested to know your thoughts. Let's imagine this same structure, but with huge amounts of data as well. Please and thank you, Marco. On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh wrote: > Hi Marco, > > Let us start

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh

Hi Marco, Let us start simple, Provide a csv file of 5 rows for the users table. Each row has a unique user_id and one or two other columns like fictitious email etc. Also for each user_id, provide 10 rows of orders table, meaning that orders table has 5 x 10 rows for each user_id. both as

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 51977 matches

Mail list logo