After you load the data in, call `.repartition(number of executors).cache()`. If the data is evenly distributed, it may be hard to guess the root cause. Do the two clusters have the same internode bandwidth? -Xiangrui
On Tue, Jul 29, 2014 at 11:06 PM, Tan Tim <unname...@gmail.com> wrote: > input data is evenly distributed to the executors. > ---- > The input data is on the HDFS, not on the spark clusters. How can I make the > data distributed to the excutors? > > > On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> The weight vector is usually dense and if you have many partitions, >> the driver may slow down. You can also take a look at the driver >> memory inside the Executor tab in WebUI. Another setting to check is >> the HDFS block size and whether the input data is evenly distributed >> to the executors. Are the hardware specs the same for the two >> clusters? -Xiangrui >> >> On Tue, Jul 29, 2014 at 10:46 PM, Tan Tim <unname...@gmail.com> wrote: >> > The application is Logistic Regression (OWLQN), we develop a sparse >> > vector >> > version. The feature dimesions is 1M+, but its very sparse. This >> > appliction >> > can run on another spark cluster, and every stage is about 50 seconds, >> > and >> > every executors have highly cpu usage. the only difference is OS(the >> > faster >> > one is ubuntu, and the slower on is centos). > >