Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv ").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 1440, in

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI |

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are different

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are on edge so to speak. If the Spark application runs for a very long time or encounters restarts, the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto Because Apache Spark does not build clusters. This configuration option is specific to Databricks, with their managed Spark offering. It allows Databricks to automatically determine an optimal number of shuffle partitions for your workload. HTH Mich Talebzadeh,

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi, In k8s the driver is responsible for executor creation. The likelihood of your problem is that Insufficient memory allocated for executors in the K8s cluster. Even with dynamic allocation, k8s won't schedule executor pods if there is not enough free memory to fulfill their resource requests.

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
Mitch, you are right these days the attention span is getting shorter. Christian could work on a completely new thing for 3 hours and is proud to explain. It is amazing. Thanks for sharing. On Sat, Apr 27, 2024 at 9:40 PM Farshid Ashouri wrote: > Mich, this is absolutely amazing. > > Thanks

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
Mich, this is absolutely amazing. Thanks for sharing. On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, wrote: > Python for the kids. Slightly off-topic but worthwhile sharing. > > One of the things that may benefit kids is starting to learn something > new. Basically anything that can focus their

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
"128G" ).set("spark.executor.memoryOverhead", "32G" ).set("spark.driver.cores", "16" ).set("spark.driver.memory", "64G" ) I dont think b) applies as its a single machine. Kind regards, Jelle Fr

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
o 100K records. > Once I go past that amount of records the results become inconsistent and > incorrect. > > Kind regards, > Jelle Nijland > > > -- > *From:* Mich Talebzadeh > *Sent:* Wednesday, April 24, 2024 4:40 PM > *To:* Nijland, J.G.W. (Jelle, Student M-CS) < > j.g.

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
___ From: Mich Talebzadeh Sent: Wednesday, April 24, 2024 4:40 PM To: Nijland, J.G.W. (Jelle, Student M-CS) Cc: user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, seque

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
ithub issue of someone with significantly larger graphs >have similar issues. >One suggestion there blamed indexing using strings rather than ints or >longs. >I rewrote my system to use int for IDs but I ran into the same issue. >The amount of incorrect edges wa

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Thanks, Mich for your reply. I agree, it is not so scalable and efficient. But it works correctly for kafka transaction, and there is no problem with committing offset to kafka async for now. I try to tell you some more details about my streaming job. CustomReceiver does not receive anything

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist |

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver());

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
Thank you Mich for your reply. Actually, I tried to do most of your advice. When spark.streaming.kafka.allowNonConsecutiveOffsets=false, I got the following error. Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 3)

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
" In the end for my usecase I started using pvcs and pvc aware scheduling along with decommissioning. So far performance is good with this choice." How did you do this? tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi : > Hi Everyone, > > I had to explored IBM's and AWS's S3 shuffle plugins (some

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) # Define schema for parsing Kafka messages schema = StructType([

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really confusing error and I'm also confused. Please see the reply in the ticket for how to make things correct. https://issues.apache.org/jira/browse/SPARK-47718 刘唯 于2024年4月6日周六 11:41写道: > This indeed looks like a bug. I will

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server and try to glean statistics from there. But there is no one location or log file which stores them all. Databricks, which is a managed Spark solution, provides such

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition with

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
Well you can do a fair bit with the available tools The Spark UI, particularly the Staging and Executors tabs, do provide some valuable insights related to database health metrics for applications using a JDBC source. Stage Overview: This section provides a summary of all the stages executed

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external shuffle

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin or Cloud Shuffle Storage Plugin from AWS

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
Apache Uniffle (incubating) may be another solution. You can see https://github.com/apache/incubator-uniffle https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era Mich Talebzadeh 于2024年4月8日周一 07:15写道: > Splendid > > The

Re: External Spark shuffle service for k8s

2024-04-07 Thread Enrico Minack
There is Apache incubator project Uniffle: https://github.com/apache/incubator-uniffle It stores shuffle data on remote servers in memory, on local disk and HDFS. Cheers, Enrico Am 06.04.24 um 15:41 schrieb Mich Talebzadeh: I have seen some older references for shuffle service for k8s,

Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
OK, This is a common issue in Spark Structured Streaming (SSS), where the source generates data faster than Spark can process it. SSS doesn't have a built-in mechanism for directly rate-limiting the incoming data stream itself. However, consider the following: - Limit the rate at which data

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3 https://github.com/IBM/spark-s3-shuffle Though I would think a feature like this could be a part of the main Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) and it's very useful. Vakaris On Sun, Apr 7, 2024 at

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3]

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid The configurations below can be used with k8s deployments of Spark. Spark applications running on k8s can utilize these configurations to seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3. For Google GCS we may have spark_config_gcs = {

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
Thanks for your suggestion that I take it as a workaround. Whilst this workaround can potentially address storage allocation issues, I was more interested in exploring solutions that offer a more seamless integration with large distributed file systems like HDFS, GCS, or S3. This would ensure

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
You can make a PVC on K8S call it 300GB make a folder in yours dockerfile WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir start spark with adding this .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", "300gb") \

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it. Mich Talebzadeh 于2024年4月3日周三 01:55写道: > > hm. you are getting below > > AnalysisException: Append output mode not supported when there are > streaming aggregations on streaming DataFrames/DataSets without watermark; > > The

Re: [External] Re: Issue of spark with antlr version

2024-04-06 Thread Bjørn Jørgensen
*To:* Bjørn Jørgensen ; user@spark.apache.org < > user@spark.apache.org> > *Cc:* Sahni, Ashima ; > user@spark.apache.org ; Misra Parashar, Jyoti < > jyoti.misra.paras...@accenture.com> > *Subject:* Re: [External] Re: Issue of spark with antlr version > > Hi Team, > Any

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
ind regards, Dan From: Oxlade, Dan Sent: 03 April 2024 15:49 To: Oxlade, Dan ; Aaron Grubb ; user@spark.apache.org Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix Swapping out the iceberg-aws-bundle for the very latest aws provided sdk ('software.amaz

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
(ParquetFileFormat.scala:429) From: Oxlade, Dan Sent: 03 April 2024 14:33 To: Aaron Grubb ; user@spark.apache.org Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix [sorry; replying all this time] With hadoop-*-3.3.6 in place of the 3.4.0

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
April 2024 13:52 To: user@spark.apache.org Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using AWS. From: Oxlade, Dan Sent: Wednesday, April 3, 2024 2:41:11 PM To: user@spark.apache.org Subject:

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; The problem seems to be that you are using the append output mode when writing the streaming query results to Kafka. This mode

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich, Thank you so much for your response. I really appreciate your help! You mentioned "defining the watermark using the withWatermark function on the streaming_df before creating the temporary view” - I believe this is what I’m doing and it’s not working for me. Here is the exact code

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test. The original code of mine def fetch_data(self): self.sc.setLogLevel("ERROR") schema = StructType() \ .add("rowkey", StringType()) \ .add("timestamp", TimestampType()) \ .add("temperature", IntegerType())

Re: [External] Re: Issue of spark with antlr version

2024-04-01 Thread Chawla, Parul
; user@spark.apache.org ; Misra Parashar, Jyoti ; Mekala, Rajesh ; Grandhi, Venkatesh ; George, Rejish ; Tayal, Aayushi Subject: Re: [External] Re: Issue of spark with antlr version [image.png] ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul mailto:parul.cha...@accenture.com>>: Hi , Can

Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun
Hi, Cheng. Thank you for the suggestion. Your suggestion seems to have at least two themes. A. Adding a new Apache Spark community policy (contract) to guarantee MySQL LTS Versions Support. B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1) And, it brings me three questions.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Winston Lai
+1 -- Thank You & Best Regards Winston Lai From: Jay Han Date: Sunday, 24 March 2024 at 08:39 To: Kiran Kumar Dusi Cc: Farshid Ashouri , Matei Zaharia , Mich Talebzadeh , Spark dev list , user @spark Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Communit

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome! Kiran Kumar Dusi 于2024年3月21日周四 14:16写道: > +1 > > On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri < > farsheed.asho...@gmail.com> wrote: > >> +1 >> >> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, >> wrote: >> >>> Some of you may be aware that Databricks community Home |

Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI |

Re:

2024-03-21 Thread Mich Talebzadeh
You can try this val kafkaReadStream = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", broker) .option("subscribe", topicName) .option("startingOffsets", startingOffsetsMode) .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger) .load() kafkaReadStream

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Kiran Kumar Dusi
+1 On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri wrote: > +1 > > On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, > wrote: > >> Some of you may be aware that Databricks community Home | Databricks >> have just launched a knowledge sharing hub. I thought it would be a >> good idea for the Apache

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Farshid Ashouri
+1 On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, wrote: > Some of you may be aware that Databricks community Home | Databricks > have just launched a knowledge sharing hub. I thought it would be a > good idea for the Apache Spark user group to have the same, especially > for repeat questions on

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
d idea. Will be useful >>> >>> >>> >>> +1 >>> >>> >>> >>> >>> >>> >>> >>> *From: *ashok34...@yahoo.com.INVALID >>> *Date: *Monday, March 18, 2024 at 6:36 AM >>> *To: *user @s

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
com.INVALID Date: Monday, March 18, 2024 at 6:36 AM To: user @spark mailto:user@spark.apache.org>>, Spark dev list mailto:d...@spark.apache.org>>, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Cc: Matei Zaharia mailto:matei.zaha...@gmail.com>> Subject: Re: A pro

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
Hi @Mich Talebzadeh , community, Where can I find such insights on the Spark Architecture ? I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Varun Shah
+1 Great initiative. QQ : Stack overflow has a similar feature called "Collectives", but I am not sure of the expenses to create one for Apache Spark. With SO being used ( atleast before ChatGPT became quite the norm for searching questions), it already has a lot of questions asked and answered

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
>> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >> *Date: *Monday, March 18, 2024 at 6:36 AM >> *To: *user @spark , Spark dev list < >> d...@spark.apache.org>, Mich Talebzadeh >> *Cc: *Matei Zaharia >> *Subject: *R

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
org/wiki/Wernher_von_Braun>)". > > > On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud > wrote: > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALI

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
;>> >>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud < >>> mpars...@illumina.com.invalid>: >>> >>>> Good idea. Will be useful >>>> >>>> >>>> >>>> +1 >>>> >>&

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
;>> >>> >>> +1 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *From:* ashok34668@ yahoo. com. INVALID ( ashok3

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
ars 2024 kl. 17:26 skrev Parsian, Mahmoud > : > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >> *Date: *Monday, March 18, 2024 at 6:36 AM >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
y, March 18, 2024 at 6:36 AM > *To: *user @spark , Spark dev list < > d...@spark.apache.org>, Mich Talebzadeh > *Cc: *Matei Zaharia > *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache > Spark Community > > External message, be mindful when

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Code Tutelage
ark dev list < > d...@spark.apache.org>, Mich Talebzadeh > *Cc: *Matei Zaharia > *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache > Spark Community > > External message, be mindful when clicking links or attachments > > > > Good idea. Will be us

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
v list < > d...@spark.apache.org>, Mich Talebzadeh > *Cc: *Matei Zaharia > *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache > Spark Community > > External message, be mindful when clicking links or attachments > > > > Good idea. Will be useful >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Parsian, Mahmoud
Good idea. Will be useful +1 From: ashok34...@yahoo.com.INVALID Date: Monday, March 18, 2024 at 6:36 AM To: user @spark , Spark dev list , Mich Talebzadeh Cc: Matei Zaharia Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community External message, be mindful

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
Good idea. Will be useful +1 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh wrote: Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the

Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
Hi, I must admit I don't know much about this Fruchterman-Reingold (call it FR) visualization using GraphX and Kubernetes..But you are suggesting this slowdown issue starts after the second iteration, and caching/persisting the graph after each iteration does not help. FR involves many

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation

Re: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-17 Thread Ofir Manor
ases> Ofir From: Russell Jurney Sent: Friday, March 15, 2024 11:43 PM To: brad.boil...@fcc-fac.ca.invalid Cc: user@spark.apache.org Subject: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3? There is an implementa

Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released often enough to match every point version. It supports Spark 3.4. Try it - it will probably work. https://spark-packages.org/package/graphframes/graphframes Thanks, Russell Jurney @rjurney

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
Thanks for the clarification. That makes sense.. In the code below, we can see def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
Oh I see why the confusion. microbatch_data = event.progress means that microbatch_data is a StreamingQueryProgress instance, it's not a dictionary, so you should use ` microbatch_data.processedRowsPerSecond`, instead of the `get` method which is used for dictionaries. But weirdly, for

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
Hi, Do you think there is any chance for this issue to get resolved? Should I create another bug report? As mentioned in my message, there is one open already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers only one of the problems. Andrzej wt., 27 lut 2024 o 09:58 Andrzej Zera

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi, Thank you for your advice This is the amended code def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not 刘唯 于2024年3月10日周日 22:04写道: > Have you tried using microbatch_data.get("processedRowsPerSecond")? > Camel case now snake case > > Mich Talebzadeh 于2024年3月10日周日 11:46写道: > >> >> There is a paper from Databricks on this subject >> >> >>

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")? Camel case now snake case Mich Talebzadeh 于2024年3月10日周日 11:46写道: > > There is a paper from Databricks on this subject > > > https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html > > But

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
The error message shows a mismatch between the configured warehouse directory and the actual location accessible by the Spark application running in the container.. You have configured the SparkSession with spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where to store

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumption

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay interesting, maybe my assumption was incorrect, although I'm still confused. I tried to mount a central mount point that would be the same on my local machine and the container. Same error although I moved the path to /tmp/hive/data/hive/ but when I rerun the test code to save a table,

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit spark-submit --verbose \ --master yarn \

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
I use zip file personally and pass the application name (in your case main.py) as the last input line like below APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3 #

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
024年3月5日 17:09:07 > *收件人:* Pan,Bingkun > *抄送:* Dongjoon Hyun; dev; user > *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released > > Let me be more specific. > > We have two active release version lines, 3.4.x and 3.5.x. We just > released Spark 3.5.1, having a dropdown as 3.5.1 and

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
possible. > > Only by sharing the same version. json file in each version. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月5日 16:47:30 > *收件人:* Pan,Bingkun > *抄送:* Dongjoon Hyun; dev; user > *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released &

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
https://github.com/apache/spark/pull/42881 > > So, we need to manually update this file. I can manually submit an update > first to get this feature working. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月4日 6:34:42 > *收件人:* Dongjoon Hyun > *抄送:* dev;

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion. 发件人: Jungtaek Lim 日期: 2024年3月5日 星期二 10:46 收件人: Hyukjin Kwon 抄送: yangjie01 , Dongjoon Hyun , dev , user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in versionless doc (spark-website) rather than the doc being pinned to a specific version. On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon wrote: > Is this related to

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual versions, and for each individual version we depend on other released versions. This does not seem to be right to me. Also, the functionality is only in PySpark API doc which does not seem to be consistent as well. I don't

  1   2   3   4   5   6   7   8   9   10   >