date:20230921

Re: Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad

Found this issue reported earlier but was bulk closed: https://issues.apache.org/jira/browse/SPARK-27030 Regards, Shrikant On Fri, 22 Sep 2023 at 12:03 AM, Shrikant Prasad wrote: > Hi all, > > We have multiple spark jobs running in parallel trying to write into same > hive table but each job wr

Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad

Hi all, We have multiple spark jobs running in parallel trying to write into same hive table but each job writing into different partition. This was working fine with Spark 2.3 and Hadoop 2.7. But after upgrading to Spark 3.2 and Hadoop 3.2.2, these parallel jobs are failing with FileNotFound exc

Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread Mich Talebzadeh

In general you can probably do all this in spark-sql by reading in Hive table through a DF in Pyspark, then creating a TempView on that DF, select PM data through CAST() function and then use a windowing function to select the top 5 with DENSE_RANK() #Read Hive table as a DataFrame df = spark.rea

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID

Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to select the top 5 incoming IP addresses with the highest total volume of data tran

Re: Parallel write to different partitions

Parallel write to different partitions

Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

Need to split incoming data into PM on time column and find the top 5 by volume of data

4 matches

Site Navigation

Mail list logo

Footer information