Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
Good idea. Will be useful +1 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh wrote: Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID
Hey Mich, Thanks for this introduction on your forthcoming proposal "Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I recently came across an article by Databricks with title Scalable Spark Structured Streaming for REST API Destinations. Their use

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
il's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.   On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID wrote: Hello team 1) In Spark Structured Streaming does commit mean streaming data

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID
Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID
Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to  select the top 5 incoming IP addresses with the highest total volume of data

Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID
:44:14| | 84.183.253.20| 7.707176860385722|2021-08-26 23:24:31| |218.163.165.232| 9.458673015973213|2021-02-22 12:13:15| | 62.57.20.153|1.5764916247359229|2021-11-06 12:41:59| | 98.171.202.249| 3.546118349483626|2022-07-05 10:55:26| |180.140.248.193|0.9512956363005021|2021-06-27 18:16:58| | 13

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread ashok34...@yahoo.com.INVALID
Hello Mich, Thanking you for providing these useful feedbacks and responses. We appreciate your contribution to this community forum. I for myself find your posts insightful. +1 for me Best, AK On Wednesday, 6 September 2023 at 18:34:27 BST, Mich Talebzadeh wrote: Hi Varun, In answer

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID
, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: Hello, In Spark windowing does call with  Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks

Shuffle with Window().partitionBy()

2023-05-12 Thread ashok34...@yahoo.com.INVALID
Hello, In Spark windowing does call with  Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks

Potability of dockers built on different cloud platforms

2023-04-05 Thread ashok34...@yahoo.com.INVALID
Hello team Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? Will that work please. AK

Re: Online classes for spark topics

2023-03-08 Thread ashok34...@yahoo.com.INVALID
disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.   On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID wrote: Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data science and Spark

Online classes for spark topics

2023-03-07 Thread ashok34...@yahoo.com.INVALID
Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data science and Spark Structured Streaming? I would be most grateful if experts can share their experience with learners with intermediate knowledge like myself. Hopefully we will find the practical

Re: spark+kafka+dynamic resource allocation

2023-01-28 Thread ashok34...@yahoo.com.INVALID
Hi, Worth checking this link https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation On Saturday, 28 January 2023 at 06:18:28 GMT, Lingzhe Sun wrote: #yiv9684413148 body {line-height:1.5;}#yiv9684413148 ol, #yiv9684413148 ul

Re: Issue while creating spark app

2022-02-28 Thread ashok34...@yahoo.com.INVALID
Thanks for all these useful info Hi all What is the current trend. Is it Spark on Scala with intellij or Spark on python with pycharm.  I am curious because I have moderate experience with Spark on both Scala and python and want to focus on Scala OR python going forward with the intention of

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread ashok34...@yahoo.com.INVALID
Thanks Mich. Very insightful. AKOn Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh wrote: Good question. However, we ought to look at what options we have so to speak.  Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow Spark on DataProc is proven

What are the most common operators for shuffle in Spark

2022-01-23 Thread ashok34...@yahoo.com.INVALID
Hello, I know some operators in Spark are expensive because of shuffle. This document describes shuffle https://www.educba.com/spark-shuffle/ and saysMore shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling. In RDD, the below are a

Spark with parallel processing and event driven architecture

2022-01-14 Thread ashok34...@yahoo.com.INVALID
Hi gurus, I am trying to understand the role of Spark in an event driven architecture. I know Spark deals with massive parallel processing. However, does Spark follow event driven architecture like Kafka as well? Say handling producers, filtering and pushing the events to consumers like

Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-15 Thread ashok34...@yahoo.com.INVALID
arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.   On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID wrote: Gurus, I have an RDD in PySpark th

How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread ashok34...@yahoo.com.INVALID
Gurus, I have an RDD in PySpark that I can convert to DF through df = rdd.toDF() However, when I do df.printSchema() I see the columns as nullable. = true by default root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- COl-3: string (nullable = true) What would be the

Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread ashok34...@yahoo.com.INVALID
Hello team Someone asked me regarding well developed Python code with Panda dataframe and comparing that to PySpark. Under what situations one choose PySpark instead of Python and Pandas. Appreciate AK  

Re: Recovery when two spark nodes out of 6 fail

2021-06-25 Thread ashok34...@yahoo.com.INVALID
to be idempotent; ie; rerunning them shouldn’t change the outcome. Streaming jobs have benchmarking, and they will start from the last microbatch. This means that they might have to repeat the last microbatch.   From: "ashok34...@yahoo.com.INVALID" Date: Friday, June 25, 2021 at 10:38 AM

Recovery when two spark nodes out of 6 fail

2021-06-25 Thread ashok34...@yahoo.com.INVALID
Greetings, This is a scenario that we need to come up with a comprehensive answers to fulfil please. If we have 6 spark VMs each running two executors via spark-submit. -  we have two VMs failures at H/W level, rack failure - we lose 4 executors of spark out of 12 - Happening half

Re: Spark Streaming non functional requirements

2021-04-27 Thread ashok34...@yahoo.com.INVALID
, ashok34...@yahoo.com.INVALID wrote: Hello, When we design a typical spark streaming process, the focus is to get functional requirements. However, I have been asked to provide non-functional requirements as well. Likely things I can consider are Fault tolerance and Reliability (component

Spark Streaming non functional requirements

2021-04-26 Thread ashok34...@yahoo.com.INVALID
Hello, When we design a typical spark streaming process, the focus is to get functional requirements. However, I have been asked to provide non-functional requirements as well. Likely things I can consider are Fault tolerance and Reliability (component failures).  Are there a standard list of

Python level of knowledge for Spark and PySpark

2021-04-14 Thread ashok34...@yahoo.com.INVALID
Hi gurus, I have knowledge of Java, Scala and good enough knowledge of Spark, Spark SQL and Spark Functional programing with Scala. I have started using Python with Spark PySpark. Wondering, in order to be proficient in PySpark, how much good knowledge of Python programing is needed? I know the

repartition in Spark

2020-11-09 Thread ashok34...@yahoo.com.INVALID
Hi, Just need some advise. - When we have multiple spark nodes running code, under what conditions a repartition make sense? - Can we repartition and cache the result --> df = spark.sql("select from ...").repartition(4).cache - If we choose a repartition (4), will that repartition