In general you can probably do all this in spark-sql by reading in Hive
table through a DF in Pyspark, then creating a TempView on that DF, select
PM data through CAST() function and then use a windowing function to select
the top 5 with DENSE_RANK()
#Read Hive table as a DataFrame
df = spark.rea
Hello gurus,
I have a Hive table created as below (there are more columns)
CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume
INT );
Data is stored in that table
In PySpark, I want to select the top 5 incoming IP addresses with the highest
total volume of data tran