RE: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-17 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Maziyar, Mich Do we have any ticket to track this? Any idea if this is going to be fixed in 3.1.2? Thanks and Regards, Abhishek From: Mich Talebzadeh Sent: Friday, April 9, 2021 2:11 PM To: Maziyar Panahi Cc: User Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x Hi, Regarding

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
On Mon, May 17, 2021 at 2:31 PM Lalwani, Jayesh wrote: > > If the UDFs are computationally expensive, I wouldn't solve this problem with > UDFs at all. If they are working in an iterative manner, and assuming each > iteration is independent of other iterations (yes, I know that's a big >

Re: Merge two dataframes

2021-05-17 Thread Lalwani, Jayesh
If the UDFs are computationally expensive, I wouldn't solve this problem with UDFs at all. If they are working in an iterative manner, and assuming each iteration is independent of other iterations (yes, I know that's a big assumptiuon), I would think about exploding your dataframe to have a

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
In our case, these UDFs are quite expensive and worked on in an iterative manner, so being able to cache the two "sides" of the graphs independently will speed up the development cycle. Otherwise, if you modify foo() here, then you have to recompute bar and baz, even though they're unchanged.

Re: Merge two dataframes

2021-05-17 Thread Sean Owen
Why join here - just add two columns to the DataFrame directly? On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: > Anyone have ideas about the below Q? > > It seems to me that given that "diamond" DAG, that spark could see > that the rows haven't been shuffled/filtered, it could do some type

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
Anyone have ideas about the below Q? It seems to me that given that "diamond" DAG, that spark could see that the rows haven't been shuffled/filtered, it could do some type of "zip join" to push them together, but I've not been able to get a plan that doesn't do a hash/sort merge join Cheers

Spark History Server to S3 doesn't show up incomplete jobs

2021-05-17 Thread Tianbin Jiang
Hi all, I am using Spark 2.4.5. I am redirecting the spark event logs to a S3 with the following configuration: spark.eventLog.enabled = true spark.history.ui.port = 18080 spark.eventLog.dir = s3://livy-spark-log/spark-history/ spark.history.fs.logDirectory = s3://livy-spark-log/spark-history/

Re: Calculate average from Spark stream

2021-05-17 Thread Mich Talebzadeh
Hi Giuseppe , How have you defined your resultM above in qK? Cheers view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property

Re: Calculate average from Spark stream

2021-05-17 Thread Mich Talebzadeh
Hi Giuseppe, Your error state --> Required attribute 'value' not found First can you read your streaming data OK? Here in my stream in data format in json. I have three columns in json format example: {"rowkey":"f0577406-a7d3-4c52-9998-63835ea72133", "timestamp":"2021-05-17T15:17:27",