Reducing the number of instances won't help in this case. We use the
driver to collect partial gradients. Even with tree aggregation, it
still puts heavy workload on the driver with 20M features. Please try
to reduce the number of partitions before training. We are working on
a more scalable implementation of logistic regression now, which
should be able to solve this problem efficiently. -Xiangrui

On Tue, Apr 28, 2015 at 3:43 PM, sarathkrishn...@gmail.com
<sarathkrishn...@gmail.com> wrote:
> Hi,
>
> I'm just calling the standard SVMWithSGD implementation of Spark's MLLib.
> I'm not using any method like "collect".
>
> Thanks,
> Sarath
>
> On Tue, Apr 28, 2015 at 4:35 PM, ai he <heai0...@gmail.com> wrote:
>>
>> Hi Sarath,
>>
>> It might be questionable to set num-executors as 64 if you only has 8
>> nodes. Do you use any action like "collect" which will overwhelm the
>> driver since you have a large dataset?
>>
>> Thanks
>>
>> On Tue, Apr 28, 2015 at 10:50 AM, sarath <sarathkrishn...@gmail.com>
>> wrote:
>> >
>> > I am trying to train a large dataset consisting of 8 million data points
>> > and
>> > 20 million features using SVMWithSGD. But it is failing after running
>> > for
>> > some time. I tried increasing num-partitions, driver-memory,
>> > executor-memory, driver-max-resultSize. Also I tried by reducing the
>> > size of
>> > dataset from 8 million to 25K (keeping number of features same 20 M).
>> > But
>> > after using the entire 64GB driver memory for 20 to 30 min it failed.
>> >
>> > I'm using a cluster of 8 nodes (each with 8 cores and 64G RAM).
>> > executor-memory - 60G
>> > driver-memory - 60G
>> > num-executors - 64
>> > And other default settings
>> >
>> > This is the error log :
>> >
>> > 15/04/20 11:51:09 WARN NativeCodeLoader: Unable to load native-hadoop
>> > library for your platform... using builtin-java classes where applicable
>> > 15/04/20 11:51:29 WARN BLAS: Failed to load implementation from:
>> > com.github.fommil.netlib.NativeSystemBLAS
>> > 15/04/20 11:51:29 WARN BLAS: Failed to load implementation from:
>> > com.github.fommil.netlib.NativeRefBLAS
>> > 15/04/20 11:56:11 WARN TransportChannelHandler: Exception in connection
>> > from
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> > java.io.IOException: Connection reset by peer
>> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >         .......
>> > 15/04/20 11:56:11 ERROR TransportResponseHandler: Still have 7 requests
>> > outstanding when connection from xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> > is
>> > closed
>> > 15/04/20 11:56:11 ERROR OneForOneBlockFetcher: Failed while starting
>> > block
>> > fetches
>> > java.io.IOException: Connection reset by peer
>> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >         .......
>> > 15/04/20 11:56:11 ERROR OneForOneBlockFetcher: Failed while starting
>> > block
>> > fetches
>> > java.io.IOException: Connection reset by peer
>> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >         ...........
>> > 15/04/20 11:56:12 ERROR RetryingBlockFetcher: Exception while beginning
>> > fetch of 1 outstanding blocks
>> > java.io.IOException: Failed to connect to
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>> >         at
>> >
>> > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>> >         at
>> >
>> > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
>> >         at
>> >
>> > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:149)
>> >         at
>> >
>> > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:290)
>> >         at
>> >
>> > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
>> >         at
>> > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> >         at
>> >
>> > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>> >         at
>> >
>> > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> >         at
>> > org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
>> >         at
>> >
>> > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
>> >         at
>> > org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>> >         at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> >         at
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >         at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> >         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> >         at
>> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> >         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> >         at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >         at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.net.ConnectException: Connection refused:
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >         at
>> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>> >         at
>> >
>> > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
>> >         at
>> >
>> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> >         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> >         at
>> >
>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>> >         ... 1 more
>> > 15/04/20 11:56:15 ERROR RetryingBlockFetcher: Exception while beginning
>> > fetch of 1 outstanding blocks
>> > java.io.IOException: Failed to connect to
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>> >
>> > Caused by: java.net.ConnectException: Connection refused:
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >         at
>> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>> >         at
>> >
>> > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
>> >         at
>> >
>> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> >         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> >         at
>> >
>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>> >         ... 1 more
>> > 15/04/20 11:56:27 ERROR ShuffleBlockFetcherIterator: Failed to get
>> > block(s)
>> > from xxx.xxx.xxx.net:41029
>> > java.io.IOException: Failed to connect to
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>> >         at
>> >
>> > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>> >         at
>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >         at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.net.ConnectException: Connection refused:
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >         at
>> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>> >         at
>> >
>> > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
>> >         at
>> >
>> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> >         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> >         at
>> >
>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>> >         ... 1 more
>> > 15/04/20 11:56:30 ERROR ShuffleBlockFetcherIterator: Failed to get
>> > block(s)
>> > from xxx.xxx.xxx.net:41029
>> > java.io.IOException: Failed to connect to
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>> >         at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>> >         at
>> >
>> > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>> >         at
>> >
>> > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>> >         at
>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >         at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.net.ConnectException: Connection refused:
>> > xxx.xxx.xxx.net/xxx.xxx.xxx.xxx:41029
>> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >         at
>> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>> >         at
>> >
>> > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
>> >         at
>> >
>> > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> >         at
>> >
>> > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> >         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> >         at
>> >
>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>> >         ... 1 more
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-SVMWithSGD-is-failing-for-large-dataset-tp22694.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>>
>> --
>> Best
>> Ai
>
>
>
>
> --
> Sarath Krishna S

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to