Re: Spark Sql group by less performant

2018-12-10 Thread Georg Heiler
See https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html you most probably do not require exact counts. Am Di., 11. Dez. 2018 um 02:09 Uhr schrieb 15313776907 <15313776...@163.com >: > i think you can add executer memory > > 15313776907 >

Re: Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Jungtaek Lim
Please refer the structured streaming guide doc which is very clear of representing when the query will have unbounded state. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking Quoting the doc: In other words, you will

Re: Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Abhijeet Kumar
You mean to say that Spark will store all the data in memory forever :) > On 10-Dec-2018, at 6:16 PM, Sandeep Katta > wrote: > > Hi Abhijeet, > > You are using inner join with unbounded state which means every data in > stream ll match with other stream infinitely, > If you want the

Re: Spark Sql group by less performant

2018-12-10 Thread 15313776907
i think you can add executer memory | | 15313776907 | | 邮箱:15313776...@163.com | 签名由 网易邮箱大师 定制 On 12/11/2018 08:28, lsn24 wrote: Hello, I have a requirement where I need to get total count of rows and total count of failedRows based on a grouping. The code looks like below:

Spark Sql group by less performant

2018-12-10 Thread lsn24
Hello, I have a requirement where I need to get total count of rows and total count of failedRows based on a grouping. The code looks like below: myDataset.createOrReplaceTempView("temp_view"); Dataset countDataset = sparkSession.sql("Select

Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Abhijeet Kumar
Hello, I’m using watermark to join two streams as you can see below: val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds") val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds") val join_df = order_wm .join(invoice_wm, order_wm.col("s_order_id") ===