Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
 Hello,
what options are you considering yourself?
On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari 
 wrote:  
 
 Hello,

We are on Spark 3.x and using Spark dstream + kafka and planning to use 
structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges 
in structure streaming to get the microbatch end offsets to the checkpoint in 
our external checkpoint store ? Thanks in advance. 
Regards
  

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID
 Great work. Very handy for identifying problems
thanks
On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh 
 wrote:  
 
 A colleague kindly pointed out about giving an example of output which wll be 
added to README
Doing analysis for column Postcode
Json formatted output
{    "Postcode": {        "exists": true,        "num_rows": 93348,        
"data_type": "string",        "null_count": 21921,        "null_percentage": 
23.48,        "distinct_count": 38726,        "distinct_percentage": 41.49    }}
Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
Von Braun)".


On Tue, 21 May 2024 at 16:21, Mich Talebzadeh  wrote:


I just wanted to share a tool I built called spark-column-analyzer. It's a 
Python package that helps you dig into your Spark DataFrames with ease. 

Ever spend ages figuring out what's going on in your columns? Like, how many 
null values are there, or how many unique entries? Built with data preparation 
for Generative AI in mind, it aids in data imputation and augmentation – key 
steps for creating realistic synthetic data.

Basics
   
   - Effortless Column Analysis: It calculates all the important stats you need 
for each column, like null counts, distinct values, percentages, and more. No 
more manual counting or head scratching!
   - Simple to Use: Just toss in your DataFrame and call the analyze_column 
function. Bam! Insights galore.
   - Makes Data Cleaning easier: Knowing your data's quality helps you clean it 
up way faster. This package helps you figure out where the missing values are 
hiding and how much variety you've got in each column.
   - Detecting skewed columns
   - Open Source and Friendly: Feel free to tinker, suggest improvements, or 
even contribute some code yourself! We love collaboration in the Spark 
community.

Installation: 


Using pip from the link: https://pypi.org/project/spark-column-analyzer/

pip install spark-column-analyzer

Also you can clone the project from gitHub

git clone https://github.com/michTalebzadeh/spark_column_analyzer.git

The details are in the attached RENAME file
Let me know what you think! Feedback is always welcome.

HTH

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
Von Braun)".

  

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
 Good idea. Will be useful
+1
On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
 wrote:  
 
 Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID
 Hey Mich,
Thanks for this introduction on your forthcoming proposal "Spark Structured 
Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I 
recently came across an article by Databricks with title Scalable Spark 
Structured Streaming for REST API Destinations. Their use case is similar to 
your suggestion but what they are saying is that they have incoming stream of 
data from sources like Kafka, AWS Kinesis, or Azure Event Hub. In other words, 
a continuous flow of data where messages are sent to a REST API as soon as they 
are available in the streaming source. Their approach is practical but wanted 
to get your thoughts on their article with a better understanding on your 
proposal and differences.
Thanks

On Tuesday, 9 January 2024 at 00:24:19 GMT, Mich Talebzadeh 
 wrote:  
 
 Please also note that Flask, by default, is a single-threaded web framework. 
While it is suitable for development and small-scale applications, it may not 
handle concurrent requests efficiently in a production environment.In 
production, one can utilise Gunicorn (Green Unicorn) which is a WSGI ( Web 
Server Gateway Interface) that is commonly used to serve Flask applications in 
production. It provides multiple worker processes, each capable of handling a 
single request at a time. This makes Gunicorn suitable for handling multiple 
simultaneous requests and improves the concurrency and performance of your 
Flask application.

HTH
Mich Talebzadeh,Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Mon, 8 Jan 2024 at 19:30, Mich Talebzadeh  wrote:

Thought it might be useful to share my idea with fellow forum members.  During 
the breaks, I worked on the seamless integration of Spark Structured Streaming 
with Flask REST API for real-time data ingestion and analytics. The use case 
revolves around a scenario where data is generated through REST API requests in 
real time. The Flask REST API efficiently captures and processes this data, 
saving it to a Spark Structured Streaming DataFrame. Subsequently, the 
processed data could be channelled into any sink of your choice including Kafka 
pipeline, showing a robust end-to-end solution for dynamic and responsive data 
streaming. I will delve into the architecture, implementation, and benefits of 
this combination, enabling one to build an agile and efficient real-time data 
application. I will put the code in GitHub for everyone's benefit. Hopefully 
your comments will help me to improve it.
Cheers
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 

  

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
 Thank you for your feedback Mich.
In general how can one optimise the cloud data warehouses (the sink part), to 
handle streaming Spark data efficiently, avoiding bottlenecks that discussed.

AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh 
 wrote:  
 
 Hi,
Please see my responses below:
1) In Spark Structured Streaming does commit mean streaming data has been 
delivered to the sink like Snowflake?

No. a commit does not refer to data being delivered to a sink like Snowflake or 
bigQuery. The term commit refers to Spark Structured Streaming (SS) internals. 
Specifically it means that a micro-batch of data has been processed by SSS. In 
the checkpoint directory there is a subdirectory called commits that marks the 
micro-batch process as completed.
2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely 
manner, will there be an impact on spark streaming itself?

Yes, it can potentially impact SSS. If the sink cannot absorb data in a timely 
manner, the batches will start to back up in SSS. This can cause Spark to run 
out of memory and the streaming job to fail. As I understand, Spark will use a 
combination of memory and disk storage (checkpointing). This can also happen if 
the network interface between Spark and the sink is disrupted. On the other 
hand Spark may slow down, as it tries to process the backed-up batches of data. 
You want to avoid these scenarios.
HTH
Mich Talebzadeh,Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID 
 wrote:

Hello team
1) In Spark Structured Streaming does commit mean streaming data has been 
delivered to the sink like Snowflake?
2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely 
manner, will there be an impact on spark streaming itself?
Thanks

AK
  

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID
Hello team
1) In Spark Structured Streaming does commit mean streaming data has been 
delivered to the sink like Snowflake?
2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely 
manner, will there be an impact on spark streaming itself?
Thanks

AK

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID
Hello gurus,

I have a Hive table created as below (there are more columns)

CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume 
INT );

Data is stored in that table

In PySpark, I want to  select the top 5 incoming IP addresses with the highest 
total volume of data transferred during the PM hours. PM hours are decided by 
the column time_in with values like '00:45:00', '11:35:00', '18:25:00'

Any advice is appreciated.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID
;).agg(F.sum("gbps").alias("total_gbps"))

windowRank = Window.orderBy(F.col("total_gbps").desc())
agg_df = agg_df.withColumn("rank", F.percent_rank().over(windowRank))

top_80_ips = agg_df.filter(F.col("rank") <= 0.80)
result = df.join(top_80_ips, on="incoming_ips", 
how="inner").select("incoming_ips", "gbps", "date_time")
result.show()

print(df.count())
print(result_df.count())

+---+---+---+
|   incoming_ips|   gbps|  date_time|
+---+---+---+
|   66.186.8.130|  5.074283124722104|2022-03-12 05:09:16|
|  155.45.76.235| 0.6736194760917324|2021-06-19 03:36:28|
| 237.51.137.200|0.43334812775057685|2022-04-27 08:08:47|
|78.4.48.171| 7.5675453578753435|2022-08-21 18:55:48|
|  241.84.163.17| 3.5681655964070815|2021-01-24 20:39:50|
|130.255.202.138|  6.066112278135983|2023-07-07 22:26:15|
| 198.33.206.140| 1.9147905257021836|2023-03-01 04:44:14|
|  84.183.253.20|  7.707176860385722|2021-08-26 23:24:31|
|218.163.165.232|  9.458673015973213|2021-02-22 12:13:15|
|   62.57.20.153| 1.5764916247359229|2021-11-06 12:41:59|
| 245.24.168.152|0.07452805411698016|2021-06-04 16:14:36|
| 98.171.202.249|  3.546118349483626|2022-07-05 10:55:26|
|   210.5.246.85|0.02430730260109759|2022-04-08 17:26:04|
| 13.236.170.177|   2.41361938344535|2021-08-11 02:19:06|
|180.140.248.193| 0.9512956363005021|2021-06-27 18:16:58|
|  26.140.88.127|   7.51335778127692|2023-06-02 14:13:30|
|  7.118.207.252|  6.450499049816286|2022-12-11 06:36:20|
|11.8.10.136|  8.750329246667354|2023-02-03 05:33:16|
|  232.140.56.86|  4.289740988237201|2023-02-22 20:10:09|
|   68.117.9.255|  5.384340363304169|2022-12-03 09:55:26|
+---+---+---+

+---+--+---+
|   incoming_ips|  gbps|  date_time|
+---+--+---+
|   66.186.8.130| 5.074283124722104|2022-03-12 05:09:16|
|  241.84.163.17|3.5681655964070815|2021-01-24 20:39:50|
|78.4.48.171|7.5675453578753435|2022-08-21 18:55:48|
|130.255.202.138| 6.066112278135983|2023-07-07 22:26:15|
| 198.33.206.140|1.9147905257021836|2023-03-01 04:44:14|
|  84.183.253.20| 7.707176860385722|2021-08-26 23:24:31|
|218.163.165.232| 9.458673015973213|2021-02-22 12:13:15|
|   62.57.20.153|1.5764916247359229|2021-11-06 12:41:59|
| 98.171.202.249| 3.546118349483626|2022-07-05 10:55:26|
|180.140.248.193|0.9512956363005021|2021-06-27 18:16:58|
| 13.236.170.177|  2.41361938344535|2021-08-11 02:19:06|
|  26.140.88.127|  7.51335778127692|2023-06-02 14:13:30|
|  7.118.207.252| 6.450499049816286|2022-12-11 06:36:20|
|11.8.10.136| 8.750329246667354|2023-02-03 05:33:16|
|  232.140.56.86| 4.289740988237201|2023-02-22 20:10:09|
|   68.117.9.255| 5.384340363304169|2022-12-03 09:55:26|
+---+--+---+

20
16

fre. 15. sep. 2023 kl. 20:14 skrev ashok34...@yahoo.com.INVALID 
:

Hi team,
I am using PySpark 3.4
I have a table of million rows that has few columns. among them incoming ips  
and what is known as gbps (Gigabytes per second) and date and time of  incoming 
ip.
I want to filter out 20% of low active ips and work on the remainder of data. 
How can I do thiis in PySpark?
Thanks






-- 
Bjørn Jørgensen 
Vestre Aspehaug 4, 6010 Ålesund 
Norge

+47 480 94 297



-- 
Bjørn Jørgensen 
Vestre Aspehaug 4, 6010 Ålesund 
Norge

+47 480 94 297

  

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread ashok34...@yahoo.com.INVALID
 Hello Mich,
Thanking you for providing these useful feedbacks and responses.
We appreciate your contribution to this community forum. I for myself find your 
posts insightful.
+1 for me
Best,
AK
On Wednesday, 6 September 2023 at 18:34:27 BST, Mich Talebzadeh 
 wrote:  
 
 Hi Varun,
In answer to your questions, these are my views. However, they are just views 
and cannot be taken as facts so to speak
   
   -
Focus and Time Management: I often struggle with maintaining focus and 
effectively managing my time. This leads to productivity issues and affects my 
ability to undertake and complete projects efficiently.

   
   
   - Set clear goals.
   - Prioritize tasks.
   - Create a to-do list.
   - Avoid multitasking.   

   - Eliminate distractions.
   - Take regular breaks.   

   - Go to the gym and try to rest your mind and refresh yourself .
   
   -
Graduate Studies Dilemma:

   
   
   - Your mileage varies and it all depends on what you are trying to achieve. 
Graduate Studies will help you to think independently and out of the box. Will 
also lead you on "how to go about solving the problem". So it will give you 
that experience. 
   
   -
Long-Term Project Building: I am interested in working on long-term projects, 
but I am uncertain about the right approach and how to stay committed 
throughout the project's lifecycle.

   
   
   - I assume you have a degree. That means that you had the discipline to wake 
up in the morning, go to lectures and not to miss the lectures (hopefully you 
did not!). In other words, it proves that you have already been through a 
structured discipline and you have the will to do it. 
   
   -
Overcoming Fear of Failure and Procrastination: I often find myself in a 
constant fear mode of failure, which leads to abandoning pet projects shortly 
after starting them or procrastinating over initiating new ones.

   
   
   - Failure is natural and can and do happen. However, the important point is 
that you learn from your failures. Just call them experience. You need to 
overcome fear of failure and embrace the challenges.
   
   -
Risk Aversion: With no inherited wealth or financial security, I am often 
apprehensive about taking risks, even when they may potentially lead to 
significant personal or professional growth.

   
   - Welcome to the club! In 2020, it was estimated that in the UK, the richest 
10% of households hold 43% of all wealth. The poorest 50% by contrast own just 
9%  Risk is part of life. When crossing the street, you are taking a calculated 
view of the cars coming and going.In short, risk assessment is a fundamental 
aspect of life!
HTH
Mich Talebzadeh,Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Tue, 5 Sept 2023 at 22:17, Varun Shah  wrote:


Dear Apache Spark Community,

I hope this email finds you well. I am writing to seek your valuable insights 
and advice on some challenges I've been facing in my career and personal 
development journey, particularly in the context of Apache Spark and the 
broader big data ecosystem.

A little background about myself: I graduated in 2019 and have since been 
working in the field of AWS cloud and big data tools such as Spark, Airflow, 
AWS services, Databricks, and Snowflake. My interest in the world of big data 
tools dates back to 2016-17, where I initially began exploring concepts like 
big data with spark using scala, and the Scala ecosystem, including 
technologies like Akka. Additionally, I have a keen interest in functional 
programming and data structures and algorithms (DSA) applied to big data 
optimizations.

However, despite my enthusiasm and passion for these areas, I am encountering 
some challenges that are hindering my growth:
   
   -
Focus and Time Management: I often struggle with maintaining focus and 
effectively managing my time. This leads to productivity issues and affects my 
ability to undertake and complete projects efficiently.

   -
Graduate Studies Dilemma: I am unsure about whether to pursue a master's 
degree. The fear of GRE and uncertainty about getting into a reputable 
university have been holding me back. I'm unsure whether further education 
would significantly benefit my career in big data.

   -
Long-Term Project Building: I am interested in working on long-term projects, 
but I am uncertain about the right approach and how to stay committed 
throughout the project's lifecycle.

   -
Overcoming Fear of Failure and Procrastination: I often find myself in a 
constant fear mode of fail

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID
 Thanks great Rauf.
Regards
On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan  
wrote:  
 
 Hi ,
PartitionBy() is analogous to group by, all rows  that will have the 
same value in the specified column will form one window.The data will be 
shuffled to form group.
RegardsRaouf
On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID 
 wrote:

Hello,
In Spark windowing does call with  Window().partitionBy() can cause 
shuffle to take place?
If so what is the performance impact if any if the data result set is large.
Thanks
  

Shuffle with Window().partitionBy()

2023-05-12 Thread ashok34...@yahoo.com.INVALID
Hello,
In Spark windowing does call with  Window().partitionBy() can cause 
shuffle to take place?
If so what is the performance impact if any if the data result set is large.
Thanks

Potability of dockers built on different cloud platforms

2023-04-05 Thread ashok34...@yahoo.com.INVALID
Hello team
Is it possible to use Spark docker built on GCP on AWS without rebuilding from 
new on AWS?
Will that work please.
AK

Re: Online classes for spark topics

2023-03-08 Thread ashok34...@yahoo.com.INVALID
 
Hello Mich.
Greetings. Would you be able to arrange for Spark Structured Streaming learning 
webinar.?
This is something I haven been struggling with recently. it will be very 
helpful.
Thanks and Regard
AKOn Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh 
 wrote:  
 
 Hi,
This might  be a worthwhile exercise on the assumption that the contributors 
will find the time and bandwidth to chip in so to speak.
I am sure there are many but on top of my head I can think of Holden Karau for 
k8s, and Sean Owen for data science stuff. They are both very experienced.
Anyone else 🤔
HTH







   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID 
 wrote:

Hello gurus,
Does Spark arranges online webinars for special topics like Spark on K8s, data 
science and Spark Structured Streaming?
I would be most grateful if experts can share their experience with learners 
with intermediate knowledge like myself. Hopefully we will find the practical 
experiences told valuable.
Respectively,
AK 
  

Online classes for spark topics

2023-03-07 Thread ashok34...@yahoo.com.INVALID
Hello gurus,
Does Spark arranges online webinars for special topics like Spark on K8s, data 
science and Spark Structured Streaming?
I would be most grateful if experts can share their experience with learners 
with intermediate knowledge like myself. Hopefully we will find the practical 
experiences told valuable.
Respectively,
AK 

Re: spark+kafka+dynamic resource allocation

2023-01-28 Thread ashok34...@yahoo.com.INVALID
 Hi,
Worth checking this link
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

On Saturday, 28 January 2023 at 06:18:28 GMT, Lingzhe Sun 
 wrote:  
 
 #yiv9684413148 body {line-height:1.5;}#yiv9684413148 ol, #yiv9684413148 ul 
{margin-top:0px;margin-bottom:0px;list-style-position:inside;}#yiv9684413148 
body {font-size:10.5pt;font-family:'Microsoft YaHei UI';color:rgb(0, 0, 
0);line-height:1.5;}#yiv9684413148 body 
{font-size:10.5pt;font-family:'Microsoft YaHei UI';color:rgb(0, 0, 
0);line-height:1.5;}Hi all,
I'm wondering if dynamic resource allocation works in spark+kafka streaming 
applications. Here're some questions:   
   - Will structured streaming be supported?
   - Is the number of consumers always equal to the number of the partitions of 
subscribed topic (let's say there's only one topic)?
   - If consumers is evenly distributed across executors, will newly added 
executor(through dynamic resource allocation) trigger a consumer reassignment?
   - Would it be simply a bad idea to use dynamic resource allocation in 
streaming app, because there's no way to scale down number of executors unless 
no data is coming in?
Any thoughts are welcomed.
Lingzhe SunHirain Technology  

Re: Issue while creating spark app

2022-02-28 Thread ashok34...@yahoo.com.INVALID
 Thanks for all these useful info
Hi all
What is the current trend. Is it Spark on Scala with intellij or Spark on 
python with pycharm. 
I am curious because I have moderate experience with Spark on both Scala and 
python and want to focus on Scala OR python going forward with the intention of 
joining new companies in Big Data.
thanking you
On Sunday, 27 February 2022, 21:32:05 GMT, Mich Talebzadeh 
 wrote:  
 
 Might as well update the artefacts to the correct versions hopefully. 
Downloaded scala 2.12.8

 scala -version



Scala code runner version 2.12.8 -- Copyright 2002-2018, LAMP/EPFL and 
Lightbend, Inc.

Edited the pom.xml as below
http://maven.apache.org/POM/4.0.0"         
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>    
4.0.0
    spark    MichTest    
1.0
            8        
8        
            org.scala-lang        
scala-library        2.12.8    
            org.apache.spark        
spark-core_2.13        3.2.1    
            org.apache.spark        
spark-sql_2.13        3.2.1    
    
and built with maven. All good




   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Sun, 27 Feb 2022 at 20:16, Mich Talebzadeh  wrote:

Thanks Bjorn. I am aware of that. I  just really wanted to create the uber jar 
files with both sbt and maven in Intellij.




cheers







   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen  wrote:

Mitch: You are using scala 2.11 to do this. Have a look at Building Spark 
"Spark requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark 
3.0.0."
søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh :

OK I decided to give a try to maven.
Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip 
apache-maven-3.8.4-bin.zip
Then added to Windows env variable as MVN_HOME and added the bin directory to 
path in windows. Restart intellij to pick up the correct path.
Again on the command line in intellij do
mvn -vApache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)Maven home: 
d:\temp\apache-maven-3.8.4Java version: 1.8.0_73, vendor: Oracle Corporation, 
runtime: C:\Program Files\Java\jdk1.8.0_73\jreDefault locale: en_GB, platform 
encoding: Cp1252OS name: "windows 10", version: "10.0", arch: "amd64", family: 
"windows"
in Intellij add maven support to your project. Follow this link Add Maven 
support to an existing project
There will be a pom.xml file under project directory 


Edit that pom.xml file and add the following
http://maven.apache.org/POM/4.0.0";
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0

spark
MichTest
1.0


8
8



org.scala-lang
scala-library
2.11.7


org.apache.spark
spark-core_2.10
2.0.0


org.apache.spark
spark-sql_2.10
2.0.0



In intellij open a Terminal under project sub-directory where the pom file is 
created and you edited.




 mvn clean

[INFO] Scanning for projects...

[INFO] 

[INFO] ---< spark:MichTest >---

[INFO] Building MichTest 1.0

[INFO] [ jar ]-

[INFO] 

[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---

[INFO] Deleting D:\temp\intellij\MichTest\target

[INFO] 

[INFO] BUILD SUCCESS

[INFO] 

[INFO] Total time:  4.451 s

[INFO] Finished at: 2022-02-27T19:37:57Z



[INFO] 


mvn compile

[INFO] Scanning for projects...

[INFO] 

[INFO] ---< spark:MichTest >---

[INFO] Building MichTest 1.0

[INFO] [ jar ]-

[INFO] 

[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ MichTest

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread ashok34...@yahoo.com.INVALID
 Thanks Mich. Very insightful.

AKOn Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh 
 wrote:  
 
 Good question. However, we ought to look at what options we have so to speak. 
Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow



Spark on DataProc is proven and it is in useat many organizations, I have 
deployed it extensively. It is infrastructure asa service provided including 
Spark, Hadoop and other artefacts. You have tomanage cluster creation, automate 
cluster creation and tear down, submittingjobs etc. However, it is another 
stack that needs to be managed.It now has autoscaling(enables cluster worker VM 
autoscaling ) policy as well.


Spark on GKEis something newer. Worth adding that the Spark DEV team are 
working hard to improve the performanceof Spark on Kubernetes, for example, 
through Support forCustomized Kubernetes Scheduler. As I explained in the first 
thread, Spark on Kubernetes relies on containerisation.Containers make 
applications more portable. Moreover, they simplify thepackaging of 
dependencies, especially with PySpark and enable repeatable andreliable build 
workflows which is cost effective. They also reduce the overalldevops load and 
allow one to iterate on the code faster. From a purely costperspective it would 
be cheaper with Docker as you can share resourceswith your other services. You 
can create Spark docker with different versionsof Spark, Scala, Java, OS etc. 
That docker file is portable. Can be used onPrem, AWS, GCP etc in container 
registries and devops and data science peoplecan share it as well. Built once 
used by many. Kuberneteswith autopilot helps scale the nodes of the Kubernetes 
cluster depending on theload. That is what I am currently looking into.

With regard to Dataflow, which I believe issimilar to AWSGlue, it is a managed 
service for executing data processing patterns. Patternsor pipelines are built 
with the Apache Beam SDK,which is an open source programming model that 
supports Java, Python and GO. Itenables batch and streaming pipelines. You 
create your pipelines with an ApacheBeam program and then run them on the 
Dataflow service. TheApache Spark Runner can be used to execute Beam pipelines 
using Spark. When you run a job on Dataflow,it spins up a cluster of virtual 
machines, distributes the tasks in the job tothe VMs, and dynamically scales 
the cluster based on how the job is performing.As I understand both iterative 
processing and notebooks plus Machine learning withSpark ML are not currently 
supported by Dataflow

So we have three choiceshere. If you are migrating from on-prem 
Hadoop/spark/YARN set-up, you may gofor Dataproc which will provide the same 
look and feel. If you want to usemicroservices and containers in your event 
driven architecture, you can adopt dockerimages that run on Kubernetes 
clusters, including Multi-Cloud KubernetesCluster. Dataflow is probably best 
suited for green-field projects.  Lessoperational overhead, unified approach 
for batch and streaming pipelines.

So as ever your mileage varies. If you want to migratefrom your existing 
Hadoop/Spark cluster to GCP, or take advantage of yourexisting workforce, 
choose Dataproc or GKE. In many cases, a bigconsideration is that one already 
has a codebase written against a particularframework, and one just wants to 
deploy it on the GCP, so even if, say, theBeam programming mode/dataflow is 
superior to Hadoop, someone with a lot ofHadoop code might still choose 
Dataproc or GDE for the time being, rather thanrewriting their code on Beam to 
run on Dataflow.

 HTH




   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta  wrote:

Hi,may be this is useful in case someone is testing SPARK in containers for 
developing SPARK. 
>From a production scale work point of view:But if I am in AWS, I will just use 
>GLUE if I want to use containers for SPARK, without massively increasing my 
>costs for operations unnecessarily. 
Also, in case I am not wrong, GCP already has SPARK running in serverless mode. 
 Personally I would never create the overhead of additional costs and issues to 
my clients of deploying SPARK when those solutions are already available by 
Cloud vendors. Infact, that is one of the precise reasons why people use cloud 
- to reduce operational costs.
Sorry, just trying to understand what is the scope of this work.

Regards,Gourav Sengupta
On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh  
wrote:

The equivalent of Google GKE autopilot in AWS is AWS Fargate




I have not used the AWS Fargate so I can only mens

What are the most common operators for shuffle in Spark

2022-01-23 Thread ashok34...@yahoo.com.INVALID
Hello,
I know some operators in Spark are expensive because of shuffle.
This document describes shuffle
https://www.educba.com/spark-shuffle/

and saysMore shufflings in numbers are not always bad. Memory constraints and 
other impossibilities can be overcome by shuffling.

In RDD, the below are a few operations and examples of shuffle:
– subtractByKey
– groupBy
– foldByKey
– reduceByKey
– aggregateByKey
– transformations of a join of any type
– distinct
– cogroup
I know some operations like reduceBykey are well known for creating shuffle but 
what I don't understand why distinct operation should cause shuffle!

Thanking






Spark with parallel processing and event driven architecture

2022-01-14 Thread ashok34...@yahoo.com.INVALID
Hi gurus,
I am trying to understand the role of Spark in an event driven architecture. I 
know Spark deals with massive parallel processing. However, does Spark follow 
event driven architecture like Kafka as well? Say handling producers, filtering 
and pushing the events to consumers like database etc.
thanking you

Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-15 Thread ashok34...@yahoo.com.INVALID
 Many thanks all, especially to Mich. That is what I was looking for.
On Friday, 15 October 2021, 09:28:24 BST, Mich Talebzadeh 
 wrote:  
 
 Spark allows one to define the column format as StructType or list. By default 
Spark assumes that all fields are nullable when creating a dataframe.
To change nullability you need to provide the structure of the columns.
Assume that I have created an RDD in the form
rdd = sc.parallelize(Range). \         map(lambda x: (x, 
usedFunctions.clustered(x,numRows), \                           
usedFunctions.scattered(x,numRows), \                           
usedFunctions.randomised(x,numRows), \                           
usedFunctions.randomString(50), \                           
usedFunctions.padString(x," ",50), \                           
usedFunctions.padSingleChar("x",4000)))
For the above I create a schema with StructType as below:
Schema = StructType([ StructField("ID", IntegerType(), False),                  
    StructField("CLUSTERED", FloatType(), True),                      
StructField("SCATTERED", FloatType(), True),                      
StructField("RANDOMISED", FloatType(), True),                      
StructField("RANDOM_STRING", StringType(), True),                      
StructField("SMALL_VC", StringType(), True),                      
StructField("PADDING", StringType(), True)                    ])
Note that the first column ID is defined as  NOT NULL
Then I can create a dataframe df as below
df= spark.createDataFrame(rdd, schema = Schema)
df.printSchema()
root |-- ID: integer (nullable = false) |-- CLUSTERED: float (nullable = true) 
|-- SCATTERED: float (nullable = true) |-- RANDOMISED: float (nullable = true) 
|-- RANDOM_STRING: string (nullable = true) |-- SMALL_VC: string (nullable = 
true) |-- PADDING: string (nullable = true)


HTH




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID 
 wrote:

Gurus,
I have an RDD in PySpark that I can convert to DF through
df = rdd.toDF()

However, when I do
df.printSchema()

I see the columns as nullable. = true by default
root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- 
COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT 
NULLABLE
Thanking you
  

How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread ashok34...@yahoo.com.INVALID
Gurus,
I have an RDD in PySpark that I can convert to DF through
df = rdd.toDF()

However, when I do
df.printSchema()

I see the columns as nullable. = true by default
root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- 
COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT 
NULLABLE
Thanking you

Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread ashok34...@yahoo.com.INVALID
Hello team
Someone asked me regarding well developed Python code with Panda dataframe and 
comparing that to PySpark.
Under what situations one choose PySpark instead of Python and Pandas.
Appreciate

AK
 

Re: Recovery when two spark nodes out of 6 fail

2021-06-25 Thread ashok34...@yahoo.com.INVALID
 Thank you for detailed explanation.
Please on below:
  If one executor fails, it moves the processing over to other executor. 
However, if the data is lost, it re-executes the processing that generated the 
data, and might have to go back to the source.

Does this mean that only those tasks that the died executor was executing at 
the time need to be rerun to generate the processing stages. I read somewhere 
that RDD lineage keeps track of records of what needs to be re-executed.
best
On Friday, 25 June 2021, 16:23:32 BST, Lalwani, Jayesh 
 wrote:  
 
 #yiv9786628402 #yiv9786628402 -- _filtered {} _filtered {} _filtered 
{}#yiv9786628402 #yiv9786628402 p.yiv9786628402MsoNormal, #yiv9786628402 
li.yiv9786628402MsoNormal, #yiv9786628402 div.yiv9786628402MsoNormal 
{margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9786628402 
span.yiv9786628402EmailStyle20 
{font-family:sans-serif;color:windowtext;}#yiv9786628402 
.yiv9786628402MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9786628402 
div.yiv9786628402WordSection1 {}#yiv9786628402 _filtered {} _filtered 
{}#yiv9786628402 ol {margin-bottom:0in;}#yiv9786628402 ul 
{margin-bottom:0in;}#yiv9786628402 
Spark replicates the partitions among multiple nodes. If one executor fails, it 
moves the processing over to other executor. However, if the data is lost, it 
re-executes the processing that generated the data, and might have to go back 
to the source.
 
  
 
In case of failure, there will be delay in getting results. The amount of delay 
depends on how much reprocessing Spark needs to do.
 
  
 
When the driver executes an action, it submits a job to the Cluster Manager. 
The Cluster Manager starts submitting tasks to executors and monitoring them. 
In case, executors dies, the Cluster Manager does the work of reassigning the 
tasks. While all of this is going on, the driver is just sitting there waiting 
for the action to complete. SO, driver does nothing, really. The Cluster 
Manager is doing most of the work of managing the workload
 
  
 
Spark, by itself doesn’t add executors when executors fail. It just moves the 
tasks to other executors. If you are installing plain vanilla Spark on your own 
cluster, you need to figure out how to bring back executors. Most of the 
popular platforms built on top of Spark (Glue, EMR, Kubernetes) will replace 
failed nodes. You need to look into the capabilities of your chosen platform.
 
  
 
If the driver dies, the Spark job dies. There’s no recovering from that. The 
only way to recover is to run the job again. Batch jobs do not have 
benchmarking. So, they will need to reprocess everything from the beginning. 
You need to write your jobs to be idempotent; ie; rerunning them shouldn’t 
change the outcome. Streaming jobs have benchmarking, and they will start from 
the last microbatch. This means that they might have to repeat the last 
microbatch.
 
  
 
From: "ashok34...@yahoo.com.INVALID" 
Date: Friday, June 25, 2021 at 10:38 AM
To: "user@spark.apache.org" 

Subject: [EXTERNAL] Recovery when two spark nodes out of 6 fail
  
 
| 
CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.
  |


  
 
Greetings,
 
  
 
This is a scenario that we need to come up with a comprehensive answers to 
fulfil please.
 
  
 
If we have 6 spark VMs each running two executors via spark-submit.
 
  

   -  we have two VMs failures at H/W level, rack failure
   - we lose 4 executors of spark out of 12
   - Happening half way through the spark-submit job
   -   
 
So my humble questions are:
 
  

   - Will there be any data lost from the final result due to missing nodes?
   - How will RDD lineage will handle this?
   - Will there be any delay in getting the final result?
   - How the driver will handle these two nodes failure
   - Will there be additional executors added to the existing nodes or the 
existing executors will handle the job of 4 failing executors.
   - If running in client mode and the node holding the driver dies?
   - If running in cluster mode happens
 
  
 
Did search in Google no satisfactory answers gurus, hence turning to forum.
 
  
 
Best
 
  
 
A.K.
   

Recovery when two spark nodes out of 6 fail

2021-06-25 Thread ashok34...@yahoo.com.INVALID
Greetings,
This is a scenario that we need to come up with a comprehensive answers to 
fulfil please.
If we have 6 spark VMs each running two executors via spark-submit.
   
   -  we have two VMs failures at H/W level, rack failure
   - we lose 4 executors of spark out of 12
   - Happening half way through the spark-submit job
   -

So my humble questions are:
   
   - Will there be any data lost from the final result due to missing nodes?
   - How will RDD lineage will handle this?
   - Will there be any delay in getting the final result?
   - How the driver will handle these two nodes failure
   - Will there be additional executors added to the existing nodes or the 
existing executors will handle the job of 4 failing executors.
   - If running in client mode and the node holding the driver dies?
   - If running in cluster mode happens

Did search in Google no satisfactory answers gurus, hence turning to forum.
Best
A.K.

Re: Spark Streaming non functional requirements

2021-04-27 Thread ashok34...@yahoo.com.INVALID
 Hello Mich
Thank you for your great explanation.
Best
A.
On Tuesday, 27 April 2021, 11:25:19 BST, Mich Talebzadeh 
 wrote:  
 
 
Hi,
Any design (in whatever framework) needs to consider both Functional and 
non-functional requirements. Functional requirements are those which are 
related to the technical functionality of the system that we cover daily in 
this forum. The non-functional requirement is a requirement that specifies 
criteria that can be used to judge the operation of a system conditions, rather 
than specific behaviours.  From my experience the non-functional requirements 
are equally important and in some cases they are underestimated when systems go 
to production. Probably, most importantly they need to cover the following:
   
   - Processing time meeting a service-level agreement (SLA). 
  You can get some of this from Spark GUI. Are you comfortably satisfying the 
requirements? How about total delay, Back pressure etc. Are you within your 
SLA. In streaming applications, there is no such thing as an answer which is 
supposed to be late and correct. The timeliness is part of the application. If 
we get the right answer too slowly it becomes useless or wrong. We also need to 
be aware of latency trades off with throughput.
   
   - Capacity, current and forecast. 
  What is the current capacity? Have you accounted for extra demands, sudden 
surge and loads such as year-end. Can your pipeline handle 1.6-2 times the 
current load 
   
   - Scalability
  How does your application scale if you have to handle multiple topics or new 
topics added at later stages? Scalability also includes additional nodes, 
on-prem or having the ability to add more resources such as Google Dataproc 
compute engines etc   
   - Supportability and Maintainability
  Have you updated docs and procedures in Confluence or equivalent or they are 
typically a year old :).  Is there any single point of failure due to skill 
set? Can ops support the design and maintain BAU themselves. How about training 
and hand-over
   
   - Disaster recovery and Fault tolerance
  What provisions are made for disaster recovery. Is there any single point of 
failure in the design (end to end pipeline). Are you using standalone dockers 
or Google Kubernetes Engine (GKE or equivalent)
HTH

   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Mon, 26 Apr 2021 at 17:16, ashok34...@yahoo.com.INVALID 
 wrote:

Hello,
When we design a typical spark streaming process, the focus is to get 
functional requirements.
However, I have been asked to provide non-functional requirements as well. 
Likely things I can consider are Fault tolerance and Reliability (component 
failures).  Are there a standard list of non-functional requirements for Spark 
streaming that one needs to consider and verify all?
Thanking you
  

Spark Streaming non functional requirements

2021-04-26 Thread ashok34...@yahoo.com.INVALID
Hello,
When we design a typical spark streaming process, the focus is to get 
functional requirements.
However, I have been asked to provide non-functional requirements as well. 
Likely things I can consider are Fault tolerance and Reliability (component 
failures).  Are there a standard list of non-functional requirements for Spark 
streaming that one needs to consider and verify all?
Thanking you

Python level of knowledge for Spark and PySpark

2021-04-14 Thread ashok34...@yahoo.com.INVALID
Hi gurus,
I have knowledge of Java, Scala and good enough knowledge of Spark, Spark SQL 
and Spark Functional programing with Scala.
I have started using Python with Spark PySpark.
Wondering, in order to be proficient in PySpark, how much good knowledge of 
Python programing is needed? I know the answer may be very good knowledge, but 
in practice how much is good enough. I can write Python in IDE like PyCharm 
similar to the way Scala works and can run the programs. Does expert knowledge 
of Python is prerequisite for PySpark? I also know Pandas and am also familiar 
with plotting routines like matplotlib. 
Warmest
Ashok

repartition in Spark

2020-11-09 Thread ashok34...@yahoo.com.INVALID
Hi,
Just need some advise.
   
   - When we have multiple spark nodes running code, under what conditions a 
repartition make sense?
   - Can we repartition and cache the result --> df = spark.sql("select from 
...").repartition(4).cache
   - If we choose a repartition (4), will that repartition applies to all nodes 
running the code and how can one see that?

Thanks