Hi folks, Hopefully someone with more Spark experience than me can explain this a bit.
I dont' know if this is possible, impossible or just an old design that could be better. I'm running Sparkler as a spark-submit job on a databricks spark cluster and its getting to this point in the code(https://github.com/USCDataScience/sparkler/blob/master/sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/pipeline/Crawler.scala#L222-L226) val fetchedRdd = rdd.map(r => (r.getGroup, r)) .groupByKey() .flatMap({ case (grp, rs) => new FairFetcher(job, rs.iterator, localFetchDelay, FetchFunction, ParseFunction, OutLinkFilterFunction, StatusUpdateSolrTransformer) }) .persist() This basically takes the RDD and then runs a web based crawl over each RDD and returns the results. But when Spark executes it, it runs all the crawls on the driver node and doesn't distribute them. The key is pretty static in these tests, so I have also tried forcing the partition count (50 on a 16 core per node cluster) and also repartitioning, but every time all the jobs are scheduled to run on one node. What can I do better to distribute the tasks? Because the processing of the data in the RDD isn't the bottleneck, the fetching of the crawl data is the bottleneck, but that happens after the code has been assigned to a node. Thanks Tom --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org