Maybe your master or zeppelin server is running out of memory and the more data it receives the more memory swapping it has to do....something to check.
Get Outlook for Android On Wed, May 17, 2017 at 11:14 AM -0400, "Junaid Nasir" <jna...@an10.io> wrote: I have a large data set of 1B records and want to run analytics using Apache spark because of the scaling it provides, but I am seeing an anti pattern here. The more nodes I add to spark cluster, completion time increases. Data store is Cassandra, and queries are run by Zeppelin. I have tried many different queries but even a simple query of `dataframe.count()` behaves like this. Here is the zeppelin notebook, temp table has 18M records val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace")) .load().cache() df.registerTempTable("table") %sql SELECT first(devid),date,count(1) FROM table group by date,rtu order by date when tested against different no. of spark worker nodes these were the resultsSpark nodesTime4 nodes22 min 58 sec3 nodes15 min 49 sec2 nodes12 min 51 sec1 node17 min 59 sec Increasing the no. of nodes decreases performance. which should not happen as it defeats the purpose of using Spark. If you want me to run any query or further info about the setup please ask.Any cues on why this is happening would be very helpful, been stuck on this for two days now. Thank you for your time. **versions** Zeppelin: 0.7.1Spark: 2.1.0Cassandra: 2.2.9Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11 Spark cluster specs 6 vCPUs, 32 GB memory = 1 node Cassandra + Zeppelin server specs8 vCPUs, 52 GB memory