Hi folks, 

Hopefully someone with more Spark experience than me can explain this a bit.

I dont' know if this is possible, impossible or just an old design that could 
be better.

I'm running Sparkler as a spark-submit job on a databricks spark cluster and 
its getting to this point in the 
code(https://github.com/USCDataScience/sparkler/blob/master/sparkler-core/sparkler-app/src/main/scala/edu/usc/irds/sparkler/pipeline/Crawler.scala#L222-L226)

val fetchedRdd = rdd.map(r => (r.getGroup, r))
        .groupByKey()
        .flatMap({ case (grp, rs) => new FairFetcher(job, rs.iterator, 
localFetchDelay,
          FetchFunction, ParseFunction, OutLinkFilterFunction, 
StatusUpdateSolrTransformer) })
        .persist()

This basically takes the RDD and then runs a web based crawl over each RDD and 
returns the results. But when Spark executes it, it runs all the crawls on the 
driver node and doesn't distribute them.

The key is pretty static in these tests, so I have also tried forcing the 
partition count (50 on a 16 core per node cluster) and also repartitioning, but 
every time all the jobs are scheduled to run on one node.

What can I do better to distribute the tasks? Because the processing of the 
data in the RDD isn't the bottleneck, the fetching of the crawl data is the 
bottleneck, but that happens after the code has been assigned to a node.

Thanks

Tom


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to