[ https://issues.apache.org/jira/browse/LIVY-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zef Wolffs updated LIVY-767: ---------------------------- Summary: Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less efficient DAG (was: Spark-Cassandra-Connector's joinWithCassandraTable makes spark use only two executors) > Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less > efficient DAG > ------------------------------------------------------------------------------------- > > Key: LIVY-767 > URL: https://issues.apache.org/jira/browse/LIVY-767 > Project: Livy > Issue Type: Bug > Components: Interpreter > Affects Versions: 0.7.0 > Environment: Spark and Livy running in containers on Kubernetes. > Deployed using Microsofts Helm chart: > https://hub.helm.sh/charts/microsoft/spark. > Reporter: Zef Wolffs > Priority: Major > > I have a script that uses joinWithCassandraTable, a method available through > the spark-cassandra-connector. This script usually claims all available > executors (about 25 in my setup) for this method when ran from anywhere > directly through spark-submit. However, when I run it through Livy, spark > only claims two executors which makes the job much slower. > > The script is the following: > > {noformat} > case class TimeSeriesContainerID(timeseriescontainerid: Int) // Defines > partition key > val timeSeriesContainerIDofInterest = sc.parallelize(1 to > 20).map(TimeSeriesContainerID(_)) > val rdd = timeSeriesContainerIDofInterest.joinWithCassandraTable[(Timestamp, > Int)]("test_ts_partitions", "ts_0") > val aggregatedRDD = rdd.select("timestamp", "tsvalue").map(s => > s._2).reduceByKey((a, b) => a + b).collect.head{noformat} > > > I am happy to provide more information as requested. -- This message was sent by Atlassian Jira (v8.3.4#803005)