Hi,

I have been using a standalone spark cluster (v1.4.x) with the following
configurations. 2 nodes with 1 core each and 4g memory workers in each
node. So I had 2 executors for my app with 2 cores and 8g memory in total.

I have a table in a MySQL database which has around 10million rows. It has
around 10 columns with integer, string and date types. (say table1 with
column c1 to c10)

I run the following query,


   1. select count(*) from table1 - completes within seconds
   2. select c1, count(*) from table1 group by c1 - complete within seconds
   but more than the 1st query
   3. select c1, c2, count(*) from table1 group by c1, c2 - same behavior
   as Q2
   4. select c1, c2, c3, c4, count(*) from table1 group by c1, c2, c3, c4 -
   took a few minutes to finish
   5. select c1, c2, c3, c4, count(*) from table1 group by c1, c2, c3, c4, *c5
   *-* Executor goes OOM within a few minutes!!! *(this has one more column
   for group by statement)

It seemed like the more the group by columns added, the time grows
*exponentially!* Is this the expected behavior?
I was monitoring the MySQL process list, and observed that the data was
transmitted to the executors within a few seconds without an issue.
NOTE: I am not using any partition columns here. So, AFAIU essentially
there's only a single partition for the JDBC RDD

I ran the same query (query 5) in MySQL console and I was able to get a
result with in 3 minutes!!! So, I'm wondering what could have been the
issue here. This OOM exception is actually a blocker!

Are there any other tuning I should do? And it certainly worries me to see
that MySQL gave a significantly fast result than Spark here!

Look forward to hearing from you!

Best

-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/

Reply via email to