Re: Dstream HasOffsetRanges equivalent in Structured streaming
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure streaming to get the microbatch end offsets to the checkpoint in our external checkpoint store ? Thanks in advance. Regards
Re: A handy tool called spark-column-analyser
Great work. Very handy for identifying problems thanks On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh wrote: A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output { "Postcode": { "exists": true, "num_rows": 93348, "data_type": "string", "null_count": 21921, "null_percentage": 23.48, "distinct_count": 38726, "distinct_percentage": 41.49 }} Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)". On Tue, 21 May 2024 at 16:21, Mich Talebzadeh wrote: I just wanted to share a tool I built called spark-column-analyzer. It's a Python package that helps you dig into your Spark DataFrames with ease. Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data preparation for Generative AI in mind, it aids in data imputation and augmentation – key steps for creating realistic synthetic data. Basics - Effortless Column Analysis: It calculates all the important stats you need for each column, like null counts, distinct values, percentages, and more. No more manual counting or head scratching! - Simple to Use: Just toss in your DataFrame and call the analyze_column function. Bam! Insights galore. - Makes Data Cleaning easier: Knowing your data's quality helps you clean it up way faster. This package helps you figure out where the missing values are hiding and how much variety you've got in each column. - Detecting skewed columns - Open Source and Friendly: Feel free to tinker, suggest improvements, or even contribute some code yourself! We love collaboration in the Spark community. Installation: Using pip from the link: https://pypi.org/project/spark-column-analyzer/ pip install spark-column-analyzer Also you can clone the project from gitHub git clone https://github.com/michTalebzadeh/spark_column_analyzer.git The details are in the attached RENAME file Let me know what you think! Feedback is always welcome. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".
Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community
Good idea. Will be useful +1 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh wrote: Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the same, especially for repeat questions on Spark core, Spark SQL, Spark Structured Streaming, Spark Mlib and so forth. Apache Spark user and dev groups have been around for a good while. They are serving their purpose . We went through creating a slack community that managed to create more more heat than light.. This is what Databricks community came up with and I quote "Knowledge Sharing Hub Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success." I don't know the logistics of setting it up.but I am sure that should not be that difficult. If anyone is supportive of this proposal, let the usual +1, 0, -1 decide HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)". - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.
Hey Mich, Thanks for this introduction on your forthcoming proposal "Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I recently came across an article by Databricks with title Scalable Spark Structured Streaming for REST API Destinations. Their use case is similar to your suggestion but what they are saying is that they have incoming stream of data from sources like Kafka, AWS Kinesis, or Azure Event Hub. In other words, a continuous flow of data where messages are sent to a REST API as soon as they are available in the streaming source. Their approach is practical but wanted to get your thoughts on their article with a better understanding on your proposal and differences. Thanks On Tuesday, 9 January 2024 at 00:24:19 GMT, Mich Talebzadeh wrote: Please also note that Flask, by default, is a single-threaded web framework. While it is suitable for development and small-scale applications, it may not handle concurrent requests efficiently in a production environment.In production, one can utilise Gunicorn (Green Unicorn) which is a WSGI ( Web Server Gateway Interface) that is commonly used to serve Flask applications in production. It provides multiple worker processes, each capable of handling a single request at a time. This makes Gunicorn suitable for handling multiple simultaneous requests and improves the concurrency and performance of your Flask application. HTH Mich Talebzadeh,Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Mon, 8 Jan 2024 at 19:30, Mich Talebzadeh wrote: Thought it might be useful to share my idea with fellow forum members. During the breaks, I worked on the seamless integration of Spark Structured Streaming with Flask REST API for real-time data ingestion and analytics. The use case revolves around a scenario where data is generated through REST API requests in real time. The Flask REST API efficiently captures and processes this data, saving it to a Spark Structured Streaming DataFrame. Subsequently, the processed data could be channelled into any sink of your choice including Kafka pipeline, showing a robust end-to-end solution for dynamic and responsive data streaming. I will delve into the architecture, implementation, and benefits of this combination, enabling one to build an agile and efficient real-time data application. I will put the code in GitHub for everyone's benefit. Hopefully your comments will help me to improve it. Cheers Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
Re: Clarification with Spark Structured Streaming
Thank you for your feedback Mich. In general how can one optimise the cloud data warehouses (the sink part), to handle streaming Spark data efficiently, avoiding bottlenecks that discussed. AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh wrote: Hi, Please see my responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS) internals. Specifically it means that a micro-batch of data has been processed by SSS. In the checkpoint directory there is a subdirectory called commits that marks the micro-batch process as completed. 2) if sinks like Snowflake cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Yes, it can potentially impact SSS. If the sink cannot absorb data in a timely manner, the batches will start to back up in SSS. This can cause Spark to run out of memory and the streaming job to fail. As I understand, Spark will use a combination of memory and disk storage (checkpointing). This can also happen if the network interface between Spark and the sink is disrupted. On the other hand Spark may slow down, as it tries to process the backed-up batches of data. You want to avoid these scenarios. HTH Mich Talebzadeh,Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID wrote: Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK
Clarification with Spark Structured Streaming
Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK
Need to split incoming data into PM on time column and find the top 5 by volume of data
Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to select the top 5 incoming IP addresses with the highest total volume of data transferred during the PM hours. PM hours are decided by the column time_in with values like '00:45:00', '11:35:00', '18:25:00' Any advice is appreciated. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Filter out 20% of rows
;).agg(F.sum("gbps").alias("total_gbps")) windowRank = Window.orderBy(F.col("total_gbps").desc()) agg_df = agg_df.withColumn("rank", F.percent_rank().over(windowRank)) top_80_ips = agg_df.filter(F.col("rank") <= 0.80) result = df.join(top_80_ips, on="incoming_ips", how="inner").select("incoming_ips", "gbps", "date_time") result.show() print(df.count()) print(result_df.count()) +---+---+---+ | incoming_ips| gbps| date_time| +---+---+---+ | 66.186.8.130| 5.074283124722104|2022-03-12 05:09:16| | 155.45.76.235| 0.6736194760917324|2021-06-19 03:36:28| | 237.51.137.200|0.43334812775057685|2022-04-27 08:08:47| |78.4.48.171| 7.5675453578753435|2022-08-21 18:55:48| | 241.84.163.17| 3.5681655964070815|2021-01-24 20:39:50| |130.255.202.138| 6.066112278135983|2023-07-07 22:26:15| | 198.33.206.140| 1.9147905257021836|2023-03-01 04:44:14| | 84.183.253.20| 7.707176860385722|2021-08-26 23:24:31| |218.163.165.232| 9.458673015973213|2021-02-22 12:13:15| | 62.57.20.153| 1.5764916247359229|2021-11-06 12:41:59| | 245.24.168.152|0.07452805411698016|2021-06-04 16:14:36| | 98.171.202.249| 3.546118349483626|2022-07-05 10:55:26| | 210.5.246.85|0.02430730260109759|2022-04-08 17:26:04| | 13.236.170.177| 2.41361938344535|2021-08-11 02:19:06| |180.140.248.193| 0.9512956363005021|2021-06-27 18:16:58| | 26.140.88.127| 7.51335778127692|2023-06-02 14:13:30| | 7.118.207.252| 6.450499049816286|2022-12-11 06:36:20| |11.8.10.136| 8.750329246667354|2023-02-03 05:33:16| | 232.140.56.86| 4.289740988237201|2023-02-22 20:10:09| | 68.117.9.255| 5.384340363304169|2022-12-03 09:55:26| +---+---+---+ +---+--+---+ | incoming_ips| gbps| date_time| +---+--+---+ | 66.186.8.130| 5.074283124722104|2022-03-12 05:09:16| | 241.84.163.17|3.5681655964070815|2021-01-24 20:39:50| |78.4.48.171|7.5675453578753435|2022-08-21 18:55:48| |130.255.202.138| 6.066112278135983|2023-07-07 22:26:15| | 198.33.206.140|1.9147905257021836|2023-03-01 04:44:14| | 84.183.253.20| 7.707176860385722|2021-08-26 23:24:31| |218.163.165.232| 9.458673015973213|2021-02-22 12:13:15| | 62.57.20.153|1.5764916247359229|2021-11-06 12:41:59| | 98.171.202.249| 3.546118349483626|2022-07-05 10:55:26| |180.140.248.193|0.9512956363005021|2021-06-27 18:16:58| | 13.236.170.177| 2.41361938344535|2021-08-11 02:19:06| | 26.140.88.127| 7.51335778127692|2023-06-02 14:13:30| | 7.118.207.252| 6.450499049816286|2022-12-11 06:36:20| |11.8.10.136| 8.750329246667354|2023-02-03 05:33:16| | 232.140.56.86| 4.289740988237201|2023-02-22 20:10:09| | 68.117.9.255| 5.384340363304169|2022-12-03 09:55:26| +---+--+---+ 20 16 fre. 15. sep. 2023 kl. 20:14 skrev ashok34...@yahoo.com.INVALID : Hi team, I am using PySpark 3.4 I have a table of million rows that has few columns. among them incoming ips and what is known as gbps (Gigabytes per second) and date and time of incoming ip. I want to filter out 20% of low active ips and work on the remainder of data. How can I do thiis in PySpark? Thanks -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297 -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297
Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community
Hello Mich, Thanking you for providing these useful feedbacks and responses. We appreciate your contribution to this community forum. I for myself find your posts insightful. +1 for me Best, AK On Wednesday, 6 September 2023 at 18:34:27 BST, Mich Talebzadeh wrote: Hi Varun, In answer to your questions, these are my views. However, they are just views and cannot be taken as facts so to speak - Focus and Time Management: I often struggle with maintaining focus and effectively managing my time. This leads to productivity issues and affects my ability to undertake and complete projects efficiently. - Set clear goals. - Prioritize tasks. - Create a to-do list. - Avoid multitasking. - Eliminate distractions. - Take regular breaks. - Go to the gym and try to rest your mind and refresh yourself . - Graduate Studies Dilemma: - Your mileage varies and it all depends on what you are trying to achieve. Graduate Studies will help you to think independently and out of the box. Will also lead you on "how to go about solving the problem". So it will give you that experience. - Long-Term Project Building: I am interested in working on long-term projects, but I am uncertain about the right approach and how to stay committed throughout the project's lifecycle. - I assume you have a degree. That means that you had the discipline to wake up in the morning, go to lectures and not to miss the lectures (hopefully you did not!). In other words, it proves that you have already been through a structured discipline and you have the will to do it. - Overcoming Fear of Failure and Procrastination: I often find myself in a constant fear mode of failure, which leads to abandoning pet projects shortly after starting them or procrastinating over initiating new ones. - Failure is natural and can and do happen. However, the important point is that you learn from your failures. Just call them experience. You need to overcome fear of failure and embrace the challenges. - Risk Aversion: With no inherited wealth or financial security, I am often apprehensive about taking risks, even when they may potentially lead to significant personal or professional growth. - Welcome to the club! In 2020, it was estimated that in the UK, the richest 10% of households hold 43% of all wealth. The poorest 50% by contrast own just 9% Risk is part of life. When crossing the street, you are taking a calculated view of the cars coming and going.In short, risk assessment is a fundamental aspect of life! HTH Mich Talebzadeh,Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Tue, 5 Sept 2023 at 22:17, Varun Shah wrote: Dear Apache Spark Community, I hope this email finds you well. I am writing to seek your valuable insights and advice on some challenges I've been facing in my career and personal development journey, particularly in the context of Apache Spark and the broader big data ecosystem. A little background about myself: I graduated in 2019 and have since been working in the field of AWS cloud and big data tools such as Spark, Airflow, AWS services, Databricks, and Snowflake. My interest in the world of big data tools dates back to 2016-17, where I initially began exploring concepts like big data with spark using scala, and the Scala ecosystem, including technologies like Akka. Additionally, I have a keen interest in functional programming and data structures and algorithms (DSA) applied to big data optimizations. However, despite my enthusiasm and passion for these areas, I am encountering some challenges that are hindering my growth: - Focus and Time Management: I often struggle with maintaining focus and effectively managing my time. This leads to productivity issues and affects my ability to undertake and complete projects efficiently. - Graduate Studies Dilemma: I am unsure about whether to pursue a master's degree. The fear of GRE and uncertainty about getting into a reputable university have been holding me back. I'm unsure whether further education would significantly benefit my career in big data. - Long-Term Project Building: I am interested in working on long-term projects, but I am uncertain about the right approach and how to stay committed throughout the project's lifecycle. - Overcoming Fear of Failure and Procrastination: I often find myself in a constant fear mode of fail
Re: Shuffle with Window().partitionBy()
Thanks great Rauf. Regards On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan wrote: Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window.The data will be shuffled to form group. RegardsRaouf On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: Hello, In Spark windowing does call with Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks
Shuffle with Window().partitionBy()
Hello, In Spark windowing does call with Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks
Potability of dockers built on different cloud platforms
Hello team Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? Will that work please. AK
Re: Online classes for spark topics
Hello Mich. Greetings. Would you be able to arrange for Spark Structured Streaming learning webinar.? This is something I haven been struggling with recently. it will be very helpful. Thanks and Regard AKOn Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh wrote: Hi, This might be a worthwhile exercise on the assumption that the contributors will find the time and bandwidth to chip in so to speak. I am sure there are many but on top of my head I can think of Holden Karau for k8s, and Sean Owen for data science stuff. They are both very experienced. Anyone else 🤔 HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID wrote: Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data science and Spark Structured Streaming? I would be most grateful if experts can share their experience with learners with intermediate knowledge like myself. Hopefully we will find the practical experiences told valuable. Respectively, AK
Online classes for spark topics
Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data science and Spark Structured Streaming? I would be most grateful if experts can share their experience with learners with intermediate knowledge like myself. Hopefully we will find the practical experiences told valuable. Respectively, AK
Re: spark+kafka+dynamic resource allocation
Hi, Worth checking this link https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation On Saturday, 28 January 2023 at 06:18:28 GMT, Lingzhe Sun wrote: #yiv9684413148 body {line-height:1.5;}#yiv9684413148 ol, #yiv9684413148 ul {margin-top:0px;margin-bottom:0px;list-style-position:inside;}#yiv9684413148 body {font-size:10.5pt;font-family:'Microsoft YaHei UI';color:rgb(0, 0, 0);line-height:1.5;}#yiv9684413148 body {font-size:10.5pt;font-family:'Microsoft YaHei UI';color:rgb(0, 0, 0);line-height:1.5;}Hi all, I'm wondering if dynamic resource allocation works in spark+kafka streaming applications. Here're some questions: - Will structured streaming be supported? - Is the number of consumers always equal to the number of the partitions of subscribed topic (let's say there's only one topic)? - If consumers is evenly distributed across executors, will newly added executor(through dynamic resource allocation) trigger a consumer reassignment? - Would it be simply a bad idea to use dynamic resource allocation in streaming app, because there's no way to scale down number of executors unless no data is coming in? Any thoughts are welcomed. Lingzhe SunHirain Technology
Re: Issue while creating spark app
Thanks for all these useful info Hi all What is the current trend. Is it Spark on Scala with intellij or Spark on python with pycharm. I am curious because I have moderate experience with Spark on both Scala and python and want to focus on Scala OR python going forward with the intention of joining new companies in Big Data. thanking you On Sunday, 27 February 2022, 21:32:05 GMT, Mich Talebzadeh wrote: Might as well update the artefacts to the correct versions hopefully. Downloaded scala 2.12.8 scala -version Scala code runner version 2.12.8 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc. Edited the pom.xml as below http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 spark MichTest 1.0 8 8 org.scala-lang scala-library 2.12.8 org.apache.spark spark-core_2.13 3.2.1 org.apache.spark spark-sql_2.13 3.2.1 and built with maven. All good view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Sun, 27 Feb 2022 at 20:16, Mich Talebzadeh wrote: Thanks Bjorn. I am aware of that. I just really wanted to create the uber jar files with both sbt and maven in Intellij. cheers view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen wrote: Mitch: You are using scala 2.11 to do this. Have a look at Building Spark "Spark requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark 3.0.0." søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh : OK I decided to give a try to maven. Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip apache-maven-3.8.4-bin.zip Then added to Windows env variable as MVN_HOME and added the bin directory to path in windows. Restart intellij to pick up the correct path. Again on the command line in intellij do mvn -vApache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)Maven home: d:\temp\apache-maven-3.8.4Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk1.8.0_73\jreDefault locale: en_GB, platform encoding: Cp1252OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows" in Intellij add maven support to your project. Follow this link Add Maven support to an existing project There will be a pom.xml file under project directory Edit that pom.xml file and add the following http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 spark MichTest 1.0 8 8 org.scala-lang scala-library 2.11.7 org.apache.spark spark-core_2.10 2.0.0 org.apache.spark spark-sql_2.10 2.0.0 In intellij open a Terminal under project sub-directory where the pom file is created and you edited. mvn clean [INFO] Scanning for projects... [INFO] [INFO] ---< spark:MichTest >--- [INFO] Building MichTest 1.0 [INFO] [ jar ]- [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest --- [INFO] Deleting D:\temp\intellij\MichTest\target [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 4.451 s [INFO] Finished at: 2022-02-27T19:37:57Z [INFO] mvn compile [INFO] Scanning for projects... [INFO] [INFO] ---< spark:MichTest >--- [INFO] Building MichTest 1.0 [INFO] [ jar ]- [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ MichTest
Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings
Thanks Mich. Very insightful. AKOn Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh wrote: Good question. However, we ought to look at what options we have so to speak. Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow Spark on DataProc is proven and it is in useat many organizations, I have deployed it extensively. It is infrastructure asa service provided including Spark, Hadoop and other artefacts. You have tomanage cluster creation, automate cluster creation and tear down, submittingjobs etc. However, it is another stack that needs to be managed.It now has autoscaling(enables cluster worker VM autoscaling ) policy as well. Spark on GKEis something newer. Worth adding that the Spark DEV team are working hard to improve the performanceof Spark on Kubernetes, for example, through Support forCustomized Kubernetes Scheduler. As I explained in the first thread, Spark on Kubernetes relies on containerisation.Containers make applications more portable. Moreover, they simplify thepackaging of dependencies, especially with PySpark and enable repeatable andreliable build workflows which is cost effective. They also reduce the overalldevops load and allow one to iterate on the code faster. From a purely costperspective it would be cheaper with Docker as you can share resourceswith your other services. You can create Spark docker with different versionsof Spark, Scala, Java, OS etc. That docker file is portable. Can be used onPrem, AWS, GCP etc in container registries and devops and data science peoplecan share it as well. Built once used by many. Kuberneteswith autopilot helps scale the nodes of the Kubernetes cluster depending on theload. That is what I am currently looking into. With regard to Dataflow, which I believe issimilar to AWSGlue, it is a managed service for executing data processing patterns. Patternsor pipelines are built with the Apache Beam SDK,which is an open source programming model that supports Java, Python and GO. Itenables batch and streaming pipelines. You create your pipelines with an ApacheBeam program and then run them on the Dataflow service. TheApache Spark Runner can be used to execute Beam pipelines using Spark. When you run a job on Dataflow,it spins up a cluster of virtual machines, distributes the tasks in the job tothe VMs, and dynamically scales the cluster based on how the job is performing.As I understand both iterative processing and notebooks plus Machine learning withSpark ML are not currently supported by Dataflow So we have three choiceshere. If you are migrating from on-prem Hadoop/spark/YARN set-up, you may gofor Dataproc which will provide the same look and feel. If you want to usemicroservices and containers in your event driven architecture, you can adopt dockerimages that run on Kubernetes clusters, including Multi-Cloud KubernetesCluster. Dataflow is probably best suited for green-field projects. Lessoperational overhead, unified approach for batch and streaming pipelines. So as ever your mileage varies. If you want to migratefrom your existing Hadoop/Spark cluster to GCP, or take advantage of yourexisting workforce, choose Dataproc or GKE. In many cases, a bigconsideration is that one already has a codebase written against a particularframework, and one just wants to deploy it on the GCP, so even if, say, theBeam programming mode/dataflow is superior to Hadoop, someone with a lot ofHadoop code might still choose Dataproc or GDE for the time being, rather thanrewriting their code on Beam to run on Dataflow. HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta wrote: Hi,may be this is useful in case someone is testing SPARK in containers for developing SPARK. >From a production scale work point of view:But if I am in AWS, I will just use >GLUE if I want to use containers for SPARK, without massively increasing my >costs for operations unnecessarily. Also, in case I am not wrong, GCP already has SPARK running in serverless mode. Personally I would never create the overhead of additional costs and issues to my clients of deploying SPARK when those solutions are already available by Cloud vendors. Infact, that is one of the precise reasons why people use cloud - to reduce operational costs. Sorry, just trying to understand what is the scope of this work. Regards,Gourav Sengupta On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh wrote: The equivalent of Google GKE autopilot in AWS is AWS Fargate I have not used the AWS Fargate so I can only mens
What are the most common operators for shuffle in Spark
Hello, I know some operators in Spark are expensive because of shuffle. This document describes shuffle https://www.educba.com/spark-shuffle/ and saysMore shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling. In RDD, the below are a few operations and examples of shuffle: – subtractByKey – groupBy – foldByKey – reduceByKey – aggregateByKey – transformations of a join of any type – distinct – cogroup I know some operations like reduceBykey are well known for creating shuffle but what I don't understand why distinct operation should cause shuffle! Thanking
Spark with parallel processing and event driven architecture
Hi gurus, I am trying to understand the role of Spark in an event driven architecture. I know Spark deals with massive parallel processing. However, does Spark follow event driven architecture like Kafka as well? Say handling producers, filtering and pushing the events to consumers like database etc. thanking you
Re: How to change a DataFrame column from nullable to not nullable in PySpark
Many thanks all, especially to Mich. That is what I was looking for. On Friday, 15 October 2021, 09:28:24 BST, Mich Talebzadeh wrote: Spark allows one to define the column format as StructType or list. By default Spark assumes that all fields are nullable when creating a dataframe. To change nullability you need to provide the structure of the columns. Assume that I have created an RDD in the form rdd = sc.parallelize(Range). \ map(lambda x: (x, usedFunctions.clustered(x,numRows), \ usedFunctions.scattered(x,numRows), \ usedFunctions.randomised(x,numRows), \ usedFunctions.randomString(50), \ usedFunctions.padString(x," ",50), \ usedFunctions.padSingleChar("x",4000))) For the above I create a schema with StructType as below: Schema = StructType([ StructField("ID", IntegerType(), False), StructField("CLUSTERED", FloatType(), True), StructField("SCATTERED", FloatType(), True), StructField("RANDOMISED", FloatType(), True), StructField("RANDOM_STRING", StringType(), True), StructField("SMALL_VC", StringType(), True), StructField("PADDING", StringType(), True) ]) Note that the first column ID is defined as NOT NULL Then I can create a dataframe df as below df= spark.createDataFrame(rdd, schema = Schema) df.printSchema() root |-- ID: integer (nullable = false) |-- CLUSTERED: float (nullable = true) |-- SCATTERED: float (nullable = true) |-- RANDOMISED: float (nullable = true) |-- RANDOM_STRING: string (nullable = true) |-- SMALL_VC: string (nullable = true) |-- PADDING: string (nullable = true) HTH view my Linkedin profile Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID wrote: Gurus, I have an RDD in PySpark that I can convert to DF through df = rdd.toDF() However, when I do df.printSchema() I see the columns as nullable. = true by default root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT NULLABLE Thanking you
How to change a DataFrame column from nullable to not nullable in PySpark
Gurus, I have an RDD in PySpark that I can convert to DF through df = rdd.toDF() However, when I do df.printSchema() I see the columns as nullable. = true by default root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT NULLABLE Thanking you
Well balanced Python code with Pandas compared to PySpark
Hello team Someone asked me regarding well developed Python code with Panda dataframe and comparing that to PySpark. Under what situations one choose PySpark instead of Python and Pandas. Appreciate AK
Re: Recovery when two spark nodes out of 6 fail
Thank you for detailed explanation. Please on below: If one executor fails, it moves the processing over to other executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go back to the source. Does this mean that only those tasks that the died executor was executing at the time need to be rerun to generate the processing stages. I read somewhere that RDD lineage keeps track of records of what needs to be re-executed. best On Friday, 25 June 2021, 16:23:32 BST, Lalwani, Jayesh wrote: #yiv9786628402 #yiv9786628402 -- _filtered {} _filtered {} _filtered {}#yiv9786628402 #yiv9786628402 p.yiv9786628402MsoNormal, #yiv9786628402 li.yiv9786628402MsoNormal, #yiv9786628402 div.yiv9786628402MsoNormal {margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9786628402 span.yiv9786628402EmailStyle20 {font-family:sans-serif;color:windowtext;}#yiv9786628402 .yiv9786628402MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9786628402 div.yiv9786628402WordSection1 {}#yiv9786628402 _filtered {} _filtered {}#yiv9786628402 ol {margin-bottom:0in;}#yiv9786628402 ul {margin-bottom:0in;}#yiv9786628402 Spark replicates the partitions among multiple nodes. If one executor fails, it moves the processing over to other executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go back to the source. In case of failure, there will be delay in getting results. The amount of delay depends on how much reprocessing Spark needs to do. When the driver executes an action, it submits a job to the Cluster Manager. The Cluster Manager starts submitting tasks to executors and monitoring them. In case, executors dies, the Cluster Manager does the work of reassigning the tasks. While all of this is going on, the driver is just sitting there waiting for the action to complete. SO, driver does nothing, really. The Cluster Manager is doing most of the work of managing the workload Spark, by itself doesn’t add executors when executors fail. It just moves the tasks to other executors. If you are installing plain vanilla Spark on your own cluster, you need to figure out how to bring back executors. Most of the popular platforms built on top of Spark (Glue, EMR, Kubernetes) will replace failed nodes. You need to look into the capabilities of your chosen platform. If the driver dies, the Spark job dies. There’s no recovering from that. The only way to recover is to run the job again. Batch jobs do not have benchmarking. So, they will need to reprocess everything from the beginning. You need to write your jobs to be idempotent; ie; rerunning them shouldn’t change the outcome. Streaming jobs have benchmarking, and they will start from the last microbatch. This means that they might have to repeat the last microbatch. From: "ashok34...@yahoo.com.INVALID" Date: Friday, June 25, 2021 at 10:38 AM To: "user@spark.apache.org" Subject: [EXTERNAL] Recovery when two spark nodes out of 6 fail | CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. | Greetings, This is a scenario that we need to come up with a comprehensive answers to fulfil please. If we have 6 spark VMs each running two executors via spark-submit. - we have two VMs failures at H/W level, rack failure - we lose 4 executors of spark out of 12 - Happening half way through the spark-submit job - So my humble questions are: - Will there be any data lost from the final result due to missing nodes? - How will RDD lineage will handle this? - Will there be any delay in getting the final result? - How the driver will handle these two nodes failure - Will there be additional executors added to the existing nodes or the existing executors will handle the job of 4 failing executors. - If running in client mode and the node holding the driver dies? - If running in cluster mode happens Did search in Google no satisfactory answers gurus, hence turning to forum. Best A.K.
Recovery when two spark nodes out of 6 fail
Greetings, This is a scenario that we need to come up with a comprehensive answers to fulfil please. If we have 6 spark VMs each running two executors via spark-submit. - we have two VMs failures at H/W level, rack failure - we lose 4 executors of spark out of 12 - Happening half way through the spark-submit job - So my humble questions are: - Will there be any data lost from the final result due to missing nodes? - How will RDD lineage will handle this? - Will there be any delay in getting the final result? - How the driver will handle these two nodes failure - Will there be additional executors added to the existing nodes or the existing executors will handle the job of 4 failing executors. - If running in client mode and the node holding the driver dies? - If running in cluster mode happens Did search in Google no satisfactory answers gurus, hence turning to forum. Best A.K.
Re: Spark Streaming non functional requirements
Hello Mich Thank you for your great explanation. Best A. On Tuesday, 27 April 2021, 11:25:19 BST, Mich Talebzadeh wrote: Hi, Any design (in whatever framework) needs to consider both Functional and non-functional requirements. Functional requirements are those which are related to the technical functionality of the system that we cover daily in this forum. The non-functional requirement is a requirement that specifies criteria that can be used to judge the operation of a system conditions, rather than specific behaviours. From my experience the non-functional requirements are equally important and in some cases they are underestimated when systems go to production. Probably, most importantly they need to cover the following: - Processing time meeting a service-level agreement (SLA). You can get some of this from Spark GUI. Are you comfortably satisfying the requirements? How about total delay, Back pressure etc. Are you within your SLA. In streaming applications, there is no such thing as an answer which is supposed to be late and correct. The timeliness is part of the application. If we get the right answer too slowly it becomes useless or wrong. We also need to be aware of latency trades off with throughput. - Capacity, current and forecast. What is the current capacity? Have you accounted for extra demands, sudden surge and loads such as year-end. Can your pipeline handle 1.6-2 times the current load - Scalability How does your application scale if you have to handle multiple topics or new topics added at later stages? Scalability also includes additional nodes, on-prem or having the ability to add more resources such as Google Dataproc compute engines etc - Supportability and Maintainability Have you updated docs and procedures in Confluence or equivalent or they are typically a year old :). Is there any single point of failure due to skill set? Can ops support the design and maintain BAU themselves. How about training and hand-over - Disaster recovery and Fault tolerance What provisions are made for disaster recovery. Is there any single point of failure in the design (end to end pipeline). Are you using standalone dockers or Google Kubernetes Engine (GKE or equivalent) HTH view my Linkedin profile Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Mon, 26 Apr 2021 at 17:16, ashok34...@yahoo.com.INVALID wrote: Hello, When we design a typical spark streaming process, the focus is to get functional requirements. However, I have been asked to provide non-functional requirements as well. Likely things I can consider are Fault tolerance and Reliability (component failures). Are there a standard list of non-functional requirements for Spark streaming that one needs to consider and verify all? Thanking you
Spark Streaming non functional requirements
Hello, When we design a typical spark streaming process, the focus is to get functional requirements. However, I have been asked to provide non-functional requirements as well. Likely things I can consider are Fault tolerance and Reliability (component failures). Are there a standard list of non-functional requirements for Spark streaming that one needs to consider and verify all? Thanking you
Python level of knowledge for Spark and PySpark
Hi gurus, I have knowledge of Java, Scala and good enough knowledge of Spark, Spark SQL and Spark Functional programing with Scala. I have started using Python with Spark PySpark. Wondering, in order to be proficient in PySpark, how much good knowledge of Python programing is needed? I know the answer may be very good knowledge, but in practice how much is good enough. I can write Python in IDE like PyCharm similar to the way Scala works and can run the programs. Does expert knowledge of Python is prerequisite for PySpark? I also know Pandas and am also familiar with plotting routines like matplotlib. Warmest Ashok
repartition in Spark
Hi, Just need some advise. - When we have multiple spark nodes running code, under what conditions a repartition make sense? - Can we repartition and cache the result --> df = spark.sql("select from ...").repartition(4).cache - If we choose a repartition (4), will that repartition applies to all nodes running the code and how can one see that? Thanks