See https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html you most probably do not require exact counts.
Am Di., 11. Dez. 2018 um 02:09 Uhr schrieb 15313776907 <15313776...@163.com >: > i think you can add executer memory > > 15313776907 > 邮箱:15313776...@163.com > > <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?&name=15313776907&uid=15313776907%40163.com&ftlId=1&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9A15313776907%40163.com%22%5D> > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制 > > On 12/11/2018 08:28, lsn24 <lekshmi.s...@gmail.com> wrote: > Hello, > > I have a requirement where I need to get total count of rows and total > count of failedRows based on a grouping. > > The code looks like below: > > myDataset.createOrReplaceTempView("temp_view"); > > Dataset <Row> countDataset = sparkSession.sql("Select > column1,column2,column3,column4,column5,column6,column7,column8, count(*) > as > totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as > failedRows > from temp_view group by > column1,column2,column3,column4,column5,column6,column7,column8"); > > > Up till around 50 Million records, the query performance was ok. After > that > it gave it up. Mostly resulting in out of Memory exception. > > I read documentation and blogs, most of them gives me examples of > RDD.reduceByKey. But here I got dataset and spark Sql. > > What am I missing here ? . > > Any help will be appreciated. > > Thanks! > > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >