Re: Spark Sql group by less performant

Georg Heiler Mon, 10 Dec 2018 23:44:54 -0800

See
https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
you most probably do not require exact counts.


Am Di., 11. Dez. 2018 um 02:09 Uhr schrieb 15313776907 <15313776...@163.com
>:

> i think you can add executer memory
>
> 15313776907
> 邮箱：15313776...@163.com
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?&name=15313776907&uid=15313776907%40163.com&ftlId=1&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9A15313776907%40163.com%22%5D>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>
> On 12/11/2018 08:28, lsn24 <lekshmi.s...@gmail.com> wrote:
> Hello,
>
> I have a requirement where I need to get total count of rows and total
> count of failedRows based on a grouping.
>
> The code looks like below:
>
> myDataset.createOrReplaceTempView("temp_view");
>
> Dataset <Row> countDataset = sparkSession.sql("Select
> column1,column2,column3,column4,column5,column6,column7,column8, count(*)
> as
> totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as
> failedRows
> from temp_view group by
> column1,column2,column3,column4,column5,column6,column7,column8");
>
>
> Up till around 50 Million records,  the query performance was ok. After
> that
> it gave it up. Mostly resulting in out of Memory exception.
>
> I read documentation and blogs, most of them gives me examples of
> RDD.reduceByKey. But here I got dataset and spark Sql.
>
> What  am I missing here ? .
>
> Any help will be appreciated.
>
> Thanks!
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark Sql group by less performant

Reply via email to