Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Gourav, Pls see my responses below : Can you please let us know: 1. the SPARK version, and the kind of streaming query that you are running? KA : Apache Spark 3.1.2 - on Dataproc using Ubunto 18.04 (the highest Spark version supported on dataproc is 3.1.2) , 2. whether you are using at

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Mich, thnx .. i'll check the thread you forwarded, and revert back. regds, Karan Alang On Sat, Feb 26, 2022 at 2:44 AM Mich Talebzadeh wrote: > Check the thread I forwarded on how to gracefully shutdown spark > structured streaming > > HTH > > On Fri, 25 Feb 2022 at 22:31, karan alang

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Gabor, i just responded to your comment on stackoverflow. regds, Karan Alang On Sat, Feb 26, 2022 at 3:06 PM Gabor Somogyi wrote: > Hi Karan, > > Plz have a look at the stackoverflow comment I've had 2 days ago > > G > > On Fri, 25 Feb 2022, 23:31 karan alang, wrote: > >> Hello All, >>

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Raghavendra Ganesh
What is optimal depends on the context of the problem. Is the intent here to find the best solution for top n values with a group by ? Both the solutions look sub-optimal to me. Window function would be expensive as it needs an order by (which a top n solution shouldn't need). It would be best to

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Mich Talebzadeh
AM I correct that with .. WHERE (SELECT COUNT(DISTINCT(Salary)).. You will have to shuffle because of DISTINCTas each worker will have to read data separately and perform the reduce task to get the local distinct value and one final shuffle to get the actual distinct for all the data? view

Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
Might as well update the artefacts to the correct versions hopefully. Downloaded scala 2.12.8 scala -version Scala code runner version 2.12.8 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc. Edited the pom.xml as below http://maven.apache.org/POM/4.0.0;

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
Rajat Kumer wrote: "Cannot find project Scala library 2.12.12 for module SparkSimpleApp" So when I google this error message I find scala project maven sync failed søn. 27. feb. 2022 kl. 21:59 skrev Mich Talebzadeh < mich.talebza...@gmail.com>:

Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
sorry which error? view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
Anyway I did google on your error and found this scala project maven sync failed Is this the same as the one you are getting? søn. 27. feb. 2022 kl. 21:16 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Thanks Bjorn. I am aware of that. I

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
"count distinct' does not have that problem, whether in a group-by or not. I'm still not sure these are equivalent queries but maybe not seeing it. Windowing makes sense when you need the whole window, or when you need sliding windows to express the desired groups. It may be unnecessary when your

Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
Thanks Bjorn. I am aware of that. I just really wanted to create the uber jar files with both sbt and maven in Intellij. cheers view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Bjørn Jørgensen
You are using distinct which collects everything to the driver. Soo use the other one :) søn. 27. feb. 2022 kl. 21:00 skrev Sid : > Basically, I am trying two different approaches for the same problem and > my concern is how it will behave in the case of big data if you talk about > millions of

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
Mitch: You are using scala 2.11 to do this. Have a look at Building Spark "Spark requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark 3.0.0." søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Basically, I am trying two different approaches for the same problem and my concern is how it will behave in the case of big data if you talk about millions of records. Which one would be faster? Is using windowing functions a better way since it will load the entire dataset into a single window

Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
OK I decided to give a try to maven. Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip apache-maven-3.8.4-bin.zip Then added to Windows env variable as MVN_HOME and added the bin directory to path in windows. Restart intellij to pick up the correct path. Again on the command

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Enrico, Thanks for your time :) Consider a huge data volume scenario, If I don't use any keywords like distinct, which one would be faster ? Window with partitionBy or normal SQL aggregation methods? and how does df.groupBy().reduceByGroups() work internally ? Thanks, Sid On Mon, Feb 28,

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Enrico Minack
Sid, Your Aggregation Query selects all employees where less than three distinct salaries exist that are larger. So, both queries seem to do the same. The Windowing Query is explicit in what it does: give me the rank for salaries per department in the given order and pick the top 3 per

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those queries look like they do fairly different things. One is selecting top employees by salary, the other is ... selecting where there are less than 3 distinct salaries or something. Not sure what the intended comparison is then; these are not equivalent ways of doing the same thing, or does

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
My bad. Aggregation Query: # Write your MySQL query statement below SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee WHERE DepartmentId =

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those two queries are identical? On Sun, Feb 27, 2022 at 11:30 AM Sid wrote: > Hi Team, > > I am aware that if windowing functions are used, then at first it loads > the entire dataset into one window,scans and then performs the other > mentioned operations for that particular window which

Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Team, I am aware that if windowing functions are used, then at first it loads the entire dataset into one window,scans and then performs the other mentioned operations for that particular window which could be slower when dealing with trillions / billions of records. I did a POC where I used

Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
Got curious with this intellij stuff. I recall using sbt rather than MVN so go to terminal in your intellij and verify what is installed sbt -version sbt version in this project: 1.3.4 sbt script version: 1.3.4 scala -version Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL