Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread Mich Talebzadeh
In general you can probably do all this in spark-sql by reading in Hive table through a DF in Pyspark, then creating a TempView on that DF, select PM data through CAST() function and then use a windowing function to select the top 5 with DENSE_RANK() #Read Hive table as a DataFrame df = spark.rea

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID
Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to  select the top 5 incoming IP addresses with the highest total volume of data tran