I'd like to understand how parallelism works in the DBScan routine in SciKit Learn running on the Cray computer and what should I do to improve the results I'm looking at.

I have adapted the existing example in [https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py] to run with 100,000 points and thus enable one processing time allowing reasonable evaluation of times obtained. I changed the parameter "n_jobs = x", "x" ranging from 1 to 6. I repeated several times the same experiments and calculated the average values ​​of the processing time.

n_jobs  time
1       21,3
2       15,1
3       14,8
4       15,2
5       15,5
6       15,0

I then get the times that appear in the table above and in the attached image. As can be seen, there was only effective gain when "n_jobs = 2" and no difference for larger quantities. And yet, the gain was only less than 30%!!

Why were the gains so small? Why was there no greater gain for a greater value of the "n_jobs" parameter? Is it possible to improve the results I have obtained?

--
Ats.,
Mauricio Reis
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to