DataFrame/JDBC very slow performance

Dhaval Patel Mon, 24 Aug 2015 08:20:08 -0700

I am trying to access a mid-size Teradata table (~100 million rows) via
JDBC in standalone mode on a single node (local[*]). When I tried with BIG
table (5B records) then no results returned upon completion of query.


I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24
cores, 126G RAM).

I have tried several memory setup and tuning options to make it work
faster, but neither of them made a huge impact.

I am sure there is something I am missing and below is my final try that
took about 11 minutes to get this simple counts vs it only took 40 seconds
using a JDBC connection through R to get the counts.


bin/pyspark --driver-memory 40g --executor-memory 40g

df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()


[image: Inline image 1]

DataFrame/JDBC very slow performance

Reply via email to