Looking inside the 'mapPartitions' transformation, some confused observations

2015-05-11 Thread myasuka
As we all know, a partition in Spark is actually an Iterator[T]. For some purpose, I want to treat each partition not an Iterator but one whole object. For example, treat Iterator[Int] to a breeze.linalg.DenseVector[Int]. Thus I use 'mapPartitions' API to achieve this, however, during the implement

Re: How to use multi thread in RDD map function ?

2014-09-27 Thread myasuka
Thank you for your reply, Actually, we have already used this parameter. Our cluster is a standalone cluster with 16 nodes, every node has 16 cores. We have 256 pairs matrices along with 256 tasks , when we set --total-executor-cores as 64, each node can launch 4 tasks simultaneously, each task

How to use multi thread in RDD map function ?

2014-09-27 Thread myasuka
Hi, everyone I come across with a problem about increasing the concurency. In a program, after shuffle write, each node should fetch 16 pair matrices to do matrix multiplication. such as: * import breeze.linalg.{DenseMatrix => BDM} pairs.map(t => { val b1 = t._2._1.asInstanceOf[BDM[Do

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-24 Thread myasuka
gt; You get better utilization when you're higher than 1. > > Aaron Davidson goes into this more somewhere in this talk -- > https://www.youtube.com/watch?v=dmL0N3qfSc8 > > On Mon, Sep 22, 2014 at 11:52 PM, Nicholas Chammas < > nicholas.chammas@ >> wrote: > >

Why recommend 2-3 tasks per CPU core ?

2014-09-22 Thread myasuka
We are now implementing a matrix multiplication algorithm on Spark, which was designed in the traditional MPI working way before. It assumes every core in the grid computes in parallel. Now in our develop environment, each executor node has 16 cores, and I assign 16 tasks to each executor node to