Re: DataFrame/JDBC very slow performance
Thanks Michael, much appreciated! Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. For your above comments - I don't get any error but just get the NULL as return value. I have tried digging deeper in the logs etc but couldn't spot anything. Is there any other suggestions to spot such buried errors? Thanks, Dhaval On Mon, Aug 24, 2015 at 6:38 PM, Michael Armbrust mich...@databricks.com wrote: Much appreciated! I am not comparing with select count(*) for performance, but it was one simple thing I tried to check the performance :). I think it now makes sense since Spark tries to extract all records before doing the count. I thought having an aggregated function query submitted over JDBC/Teradata would let Teradata do the heavy lifting. We currently only push down filters since there is a lot of variability in what types of aggregations various databases support. You can manually pushdown whatever you want by replacing the table name with a subquery (i.e. (SELECT ... FROM ...)) - How come my second query for (5B) records didn't return anything even after a long processing? If I understood correctly, Spark would try to fit it in memory and if not then might use disk space, which I have available? Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. - Am I supposed to do any Spark related tuning to make it work? My main need is to access data from these large table(s) on demand and provide aggregated and calculated results much quicker, for that I was trying out Spark. Next step I am thinking to export data in Parque files and give it a try. Do you have any suggestions for to deal with the problem? Exporting to parquet will likely be a faster option that trying to query through JDBC, since we have many more opportunities for parallelism here.
DataFrame/JDBC very slow performance
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). When I tried with BIG table (5B records) then no results returned upon completion of query. I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM). I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact. I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts. bin/pyspark --driver-memory 40g --executor-memory 40g df = sqlContext.read.jdbc(jdbc:teradata://..) df.count() [image: Inline image 1]
Re: DataFrame/JDBC very slow performance
Much appreciated! I am not comparing with select count(*) for performance, but it was one simple thing I tried to check the performance :). I think it now makes sense since Spark tries to extract all records before doing the count. I thought having an aggregated function query submitted over JDBC/Teradata would let Teradata do the heavy lifting. We currently only push down filters since there is a lot of variability in what types of aggregations various databases support. You can manually pushdown whatever you want by replacing the table name with a subquery (i.e. (SELECT ... FROM ...)) - How come my second query for (5B) records didn't return anything even after a long processing? If I understood correctly, Spark would try to fit it in memory and if not then might use disk space, which I have available? Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. - Am I supposed to do any Spark related tuning to make it work? My main need is to access data from these large table(s) on demand and provide aggregated and calculated results much quicker, for that I was trying out Spark. Next step I am thinking to export data in Parque files and give it a try. Do you have any suggestions for to deal with the problem? Exporting to parquet will likely be a faster option that trying to query through JDBC, since we have many more opportunities for parallelism here.