Re: Spark with External Shuffle Service - using saved shuffle files in the event of executor failure

2021-05-12 Thread Attila Zsolt Piros
Hello, I have answered it on the Stack Overflow. Best Regards, Attila On Wed, May 12, 2021 at 4:57 PM Chris Thomas wrote: > Hi, > > I am pretty confident I have observed Spark configured with the Shuffle > Service continuing to fetch shuffle files on a node in the event of > executor

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
Hi, In the case where the left and right hand side share a common parent like: df = spark.read.someDataframe().withColumn('rownum', row_number()) df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')

Re: Calculate average from Spark stream

2021-05-12 Thread Mich Talebzadeh
Have you managed to sort out this problem and the reason this solution is not working! Bottom line, your temperature data comes in streams every two seconds and you want an average of temperature for the past 300 seconds worth of data, in other words your windows length is 300 seconds? You also

Re: Merge two dataframes

2021-05-12 Thread Sean Owen
Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID. RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert the

Re: Merge two dataframes

2021-05-12 Thread kushagra deep
Thanks Raghvendra Will the ids for corresponding columns be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided into

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh
You can add an extra id column and perform an inner join. val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() +-+-+

Spark with External Shuffle Service - using saved shuffle files in the event of executor failure

2021-05-12 Thread Chris Thomas
Hi, I am pretty confident I have observed Spark configured with the Shuffle Service continuing to fetch shuffle files on a node in the event of executor failure, rather than recompute the shuffle files as happens without the Shuffle Service. Can anyone confirm this? (I have a SO question 

Merge two dataframes

2021-05-12 Thread kushagra deep
Hi All, I have two dataframes df1 amount_6m 100 200 300 400 500 And a second data df2 below amount_9m 500 600 700 800 900 The number of rows is same in both dataframes. Can I merge the two dataframes to achieve below df df3 amount_6m | amount_9m 100