Hmm, I misread that you need a sliding window. I am thinking out loud here: one way of dealing with this is to improve NLineInputFormat so that partitions will have a small overlapping portion in this case the overlapping portion is 50 columns So let say the matrix is divided into overlapping partitions like this [100 x col[1, n*50] ] , [100 x col[(n-1)*50+1, (2n-1)*50] ] … then we can assign each partition to a mapper to do mapPartition on it.
-------------------------------------------- Michael (Bach) Bui, PhD, Senior Staff Architect, ADATAO Inc. www.adatao.com On Dec 20, 2013, at 1:11 PM, Michael (Bach) Bui <free...@adatao.com> wrote: > Here, Tom assumed that you have your big matrix already being loaded in one > machine. Now if you want to distribute it to slave nodes you will need to > broadcast it. I would expect this broadcasting will be done once at the > beginning of your algorithm and the computation time will dominate the > overall execution time. > > On the other hand, a better way to deal with huge matrix is to store the data > in hdfs and load data into each slaves partition-by-partition. This is > fundamental data processing pattern in Spark/Hadoop world. > If you opt to do this, you will have to use suitable InputFormat to make sure > each partition has the right amount of row that you want. > For example if you are lucky each HDFS partition have exact n*50 rows, then > you can use rdd.mapPartition(func). Where func will take care of splitting > n*50-row partition into n sub matrix > > However, HDFS TextInput or SequnceInputFormat format will not guarantee each > partition has certain number of rows. What you want is NLineInputFormat, > which I think currently has not been pulled into Spark yet. > If everyone think this is needed, I can implement it quickly, it should be > pretty easy. > > > -------------------------------------------- > Michael (Bach) Bui, PhD, > Senior Staff Architect, ADATAO Inc. > www.adatao.com > > > > > On Dec 20, 2013, at 12:38 PM, Aureliano Buendia <buendia...@gmail.com> wrote: > >> >> >> >> On Fri, Dec 20, 2013 at 6:00 PM, Tom Vacek <minnesota...@gmail.com> wrote: >> Oh, I see. I was thinking that there was a computational dependency on one >> window to the next. If the computations are independent, then I think Spark >> can help you out quite a bit. >> >> I think you would want an RDD where each element is a window of your dense >> matrix. I'm not aware of a way to distribute the windows of the big matrix >> in a way that doesn't involve broadcasting the whole thing. You might have >> to tweak some config options, but I think it would work straightaway. I >> would initialize the data structure like this: >> val matB = sc.broadcast(myBigDenseMatrix) >> val distributedChunks = sc.parallelize(0 until numWindows).mapPartitions(it >> => it.map(windowID => getWindow(matB.value, windowID) ) ) >> >> Here broadcast is used instead of calling parallelize on myBigDenseMatrix. >> Is it okay to broadcast a huge amount of data? Does sharing a big data mean >> a big network io overhead comparing to calling parallelize, or is this >> overhead optimized due to the of partitioning? >> >> >> Then just apply your matrix ops as map on >> >> You maybe have your own tool for dense matrix ops, but I would suggest Scala >> Breeze. You'll have to use an old version of Breeze (current builds are for >> 2.10). Spark with Scala-2.10 is a little way off. >> >> >> On Fri, Dec 20, 2013 at 11:40 AM, Aureliano Buendia <buendia...@gmail.com> >> wrote: >> >> >> >> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <minnesota...@gmail.com> wrote: >> If you use an RDD[Array[Double]] with a row decomposition of the matrix, you >> can index windows of the rows all you want, but you're limited to 100 >> concurrent tasks. You could use a column decomposition and access subsets >> of the columns with a PartitionPruningRDD. I have to say, though, if you're >> doing dense matrix operations, they will be 100s of times faster on a shared >> mem platform. This particular matrix, at 800 MB could be a Breeze on a >> single node. >> >> The computation for every submatrix is very expensive, it takes days on a >> single node. I was hoping this can be reduced to hours or minutes with spark. >> >> Are you saying that spark is not suitable for this type of job? >> >> >