Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Jun Yang
 Guys,

I have a question regarding to Spark 1.1 broadcast implementation.

In our pipeline, we have a large multi-class LR model, which is about 1GiB
size.
To employ the benefit of Spark parallelism, a natural thinking is to
broadcast this model file to the worker node.

However, it looks that broadcast performance is not quite good.

During the process of broadcasting the model file, I just monitor the
network card throughput of worker node, their
recv/write throughput is just around 30~40 MiB( our server box is equipped
with 100MiB ethernet card).

Is this the real limitation of Spark 1.1 broadcast implementation? Or there
may be some configuration or tricks
that can help make Spark broadcast perform better.

Thanks



-- 
yangjun...@gmail.com
http://hi.baidu.com/yjpro


Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Akhil Das
​You can try the following:

- Increase ​spark.akka.frameSize (default is 10MB)
- Try using torrentBroadcast

Thanks
Best Regards

On Fri, Jan 9, 2015 at 3:41 PM, Jun Yang  wrote:

> Guys,
>
> I have a question regarding to Spark 1.1 broadcast implementation.
>
> In our pipeline, we have a large multi-class LR model, which is about 1GiB
> size.
> To employ the benefit of Spark parallelism, a natural thinking is to
> broadcast this model file to the worker node.
>
> However, it looks that broadcast performance is not quite good.
>
> During the process of broadcasting the model file, I just monitor the
> network card throughput of worker node, their
> recv/write throughput is just around 30~40 MiB( our server box is equipped
> with 100MiB ethernet card).
>
> Is this the real limitation of Spark 1.1 broadcast implementation? Or
> there may be some configuration or tricks
> that can help make Spark broadcast perform better.
>
> Thanks
>
>
>
> --
> yangjun...@gmail.com
> http://hi.baidu.com/yjpro
>


Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Davies Liu
In the current implementation of TorrentBroadcast, the blocks are
fetched one-by-one
in single thread, so it can not fully utilize the network bandwidth.

Davies

On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang  wrote:
> Guys,
>
> I have a question regarding to Spark 1.1 broadcast implementation.
>
> In our pipeline, we have a large multi-class LR model, which is about 1GiB
> size.
> To employ the benefit of Spark parallelism, a natural thinking is to
> broadcast this model file to the worker node.
>
> However, it looks that broadcast performance is not quite good.
>
> During the process of broadcasting the model file, I just monitor the
> network card throughput of worker node, their
> recv/write throughput is just around 30~40 MiB( our server box is equipped
> with 100MiB ethernet card).
>
> Is this the real limitation of Spark 1.1 broadcast implementation? Or there
> may be some configuration or tricks
> that can help make Spark broadcast perform better.
>
> Thanks
>
>
>
> --
> yangjun...@gmail.com
> http://hi.baidu.com/yjpro

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-12 Thread lihu
How about your scene? do you need use lots of Broadcast? If not, It will be
better to focus more on other thing.

At this time, there is not more better method than TorrentBroadcast. Though
one-by-one, but after one node get the data, it can act as the data source
immediately.