> > Much appreciated! I am not comparing with "select count(*)" for > performance, but it was one simple thing I tried to check the performance > :). I think it now makes sense since Spark tries to extract all records > before doing the count. I thought having an aggregated function query > submitted over JDBC/Teradata would let Teradata do the heavy lifting. >
We currently only push down filters since there is a lot of variability in what types of aggregations various databases support. You can manually pushdown whatever you want by replacing the table name with a subquery (i.e. "(SELECT ... FROM ...)") - How come my second query for (5B) records didn't return anything > even after a long processing? If I understood correctly, Spark would try to > fit it in memory and if not then might use disk space, which I have > available? > Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. > - Am I supposed to do any Spark related tuning to make it work? > > My main need is to access data from these large table(s) on demand and > provide aggregated and calculated results much quicker, for that I was > trying out Spark. Next step I am thinking to export data in Parque files > and give it a try. Do you have any suggestions for to deal with the problem? > Exporting to parquet will likely be a faster option that trying to query through JDBC, since we have many more opportunities for parallelism here.