[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3200 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71884182 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71871259 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26225/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71871237 [Test build #26225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26225/consoleFull) for PR 3200 at commit [`a8eace2`](https://github.com/apache/spark/commit/a8eace2681840b90e6c5690afd317273d461e2c4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class BlockMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71859402 @mengxr I don't know if `rows` and `cols` will be confusing in terms of naming in GridPartitioner... However, since it is private and internal, maybe it's not that big of a problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71856884 [Test build #26225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26225/consoleFull) for PR 3200 at commit [`a8eace2`](https://github.com/apache/spark/commit/a8eace2681840b90e6c5690afd317273d461e2c4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71725639 [Test build #26178 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26178/consoleFull) for PR 3200 at commit [`5eecd48`](https://github.com/apache/spark/commit/5eecd487c54f65a9eacdf8b29eea218e0a27eb20). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DenseMatrix(` * `class BlockMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71725650 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26178/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-71713783 [Test build #26178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26178/consoleFull) for PR 3200 at commit [`5eecd48`](https://github.com/apache/spark/commit/5eecd487c54f65a9eacdf8b29eea218e0a27eb20). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23596665 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23589329 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582378 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix --- End diff -- Document the behavior when nRows is negative, e.g., whether this is allowed. I'm okay with marking them as required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enable
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582393 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582384 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582389 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582376 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt --- End diff -- This is wrong. If we have 10x20 blocks and 10 partitions. `partitionRatio = 20` and `subBlocksPerRow = 1` and `subBlocksPerCol = 1`. Then we will have 200 partitions. Please update the assignment login and add tests. Btw, it may be worth computing `subBlocksPerRow` and `subBlocksPerCol` in constructor. I would recommend renaming them to `numRowBlocksPerPartition` and `numColBlocksPerPartition`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582382 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582386 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582388 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rowsPerBlock Number of rows that make up each block. The blocks forming the fi
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582372 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { --- End diff -- Should it be called `getPartition` or `getPartitionId`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582380 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt +// Number of neighboring blocks to take in each row +val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / partitionRatio).toInt +// Number of neighboring blocks to take in each column +val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / partitionRatio).toInt +// Coordinates of the block +val i = blockRowIndex / subBlocksPerRow +val j = blockColIndex / subBlocksPerCol +val blocksPerRow = math.ceil(numRowBlocks * 1.0 / subBlocksPerRow).toInt +j * blocksPerRow + i + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == r.numColBlocks) && + (this.numPartitions == r.numPartitions) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param nRows Number of rows of this matrix + * @param nCols Number of columns of this matrix + * @param numRowBlocks Number of blocks that form the rows of this matrix --- End diff -- Could it be derived from `nRows` and `rowsPerBlock`? Having `nRows`, `numRowBlocks`, and `rowsPerBlock` would leave space for inconsistent inputs.
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582374 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { + // Having the number of partitions greater than the number of sub matrices does not help + override val numPartitions = math.min(numParts, numRowBlocks * numColBlocks) + + /** + * Returns the index of the partition the SubMatrix belongs to. Tries to achieve block wise + * partitioning. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (blockRowIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) => +getBlockId(blockRowIndex, blockColIndex) + case _ => +throw new IllegalArgumentException(s"Unrecognized key. key: $key") +} + } + + /** Partitions sub-matrices as blocks with neighboring sub-matrices. */ + private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { +val totalBlocks = numRowBlocks * numColBlocks +// Gives the number of blocks that need to be in each partition +val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt --- End diff -- `partitionRatio` -> `targetNumBlocksPerPartition`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23582369 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.{Logging, Partitioner} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val numParts: Int) extends Partitioner { --- End diff -- Add doc. Remove `val`. Otherwise, `numParts` is a public field of `GridPartitioner`. We may also rename it to `suggestedNumPartitions` so users know that it might change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70945119 [Test build #25923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25923/consoleFull) for PR 3200 at commit [`f9d664b`](https://github.com/apache/spark/commit/f9d664b7ebae288d778c7694dcc380f52e6cde30). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70945130 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25923/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70936456 [Test build #25923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25923/consoleFull) for PR 3200 at commit [`f9d664b`](https://github.com/apache/spark/commit/f9d664b7ebae288d778c7694dcc380f52e6cde30). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23319026 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, --- End diff -- Why do we use partitioner to decide whether two matrices can multiply? It should be done with the matrix itself, who knows the dims and numRowsPerBlock and numColsPerBlock. Those information should be sufficient to decide whether two matrices are compatible or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23318868 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ --- End diff -- Oh, I didn't see `mllib` ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23319267 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowsPerBlock: Int, + colsPerBlock: Int) = { --- End diff -- We cannot make such assumption a
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23271456 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowsPerBlock: Int, + colsPerBlock: Int) = { --- End diff -- Will it really be the case that
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23271303 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowsPerBlock: Int, + colsPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val p
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23271097 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, --- End diff -- That is what the partitioners were initially used for. If we want to compare if the partitioning is appropriate for `multiply` or `add`, then the partitioner needs to know `rowsPerBlock` and `colsPerBlock` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23271028 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ --- End diff -- There is `treeAggregate` in `getDim` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267239 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ --- End diff -- Try to be more explicit on the imports. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267274 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) --- End diff -- private? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267287 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowsPerBlock: Int, + colsPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val p
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267256 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) --- End diff -- This is a row-major partitioning. Do we want to have block partitioning? ~~~ 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 ~~~ The current implementation would have ((1, 5, 9, 13), (2, 6, 10, 14), (3, 7, 11, 15), (4, 8, 12, 16)). A better partitioning scheme would be ((1, 2, 5, 6), (9, 10, 13, 14), (3, 4, 7, 8), (11, 12, 15, 16)). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267266 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") --- End diff -- Put the key inside the error message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267245 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, --- End diff -- Does the partitioner need to know `rowsPerBlock` and `colsPerBlock`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267293 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import org.scalatest.FunSuite + --- End diff -- remove this line. `scalatest` counts as a 3rd-party import but not scala import. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267280 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowsPerBlock: Int, + colsPerBlock: Int) = { --- End diff -- The list of arguments cannot pro
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267268 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && --- End diff -- Compare numRowBlocks and numColBlocks instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267249 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => --- End diff -- Is it the row index or row block index? `rowBlockIndex` would make the code clearer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23267240 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ --- End diff -- This is not necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70739754 [Test build #25838 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25838/consoleFull) for PR 3200 at commit [`1a63b20`](https://github.com/apache/spark/commit/1a63b204a59601734de986964d09515a0b5ab082). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70739766 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25838/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70729927 [Test build #25838 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25838/consoleFull) for PR 3200 at commit [`1a63b20`](https://github.com/apache/spark/commit/1a63b204a59601734de986964d09515a0b5ab082). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70724262 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25830/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70724245 [Test build #25830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25830/consoleFull) for PR 3200 at commit [`1e8bb2a`](https://github.com/apache/spark/commit/1e8bb2a592de611f3796c28798356878c6c89541). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70712379 [Test build #25830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25830/consoleFull) for PR 3200 at commit [`1e8bb2a`](https://github.com/apache/spark/commit/1e8bb2a592de611f3796c28798356878c6c89541). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23241236 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { --- End diff -- Hi @mbhynes. Great question. We did have such a class, it was called `BlockPartition`, which later was renamed to `SubMatrix`. I did try what you suggested, but here was the catch. Even though the partition id's match for partitions, they didn't necessarily end at the same executors. Spark distributes the partitions deterministically, so theoretically, they should end at the same executors, but what happens is, an executor lags, "something" happens, and the partitions then get jumbled around executors. We couldn't (Spark doesn't) guarantee that these partitions end up on the same machine for fault tolerance reasons (it's how the scheduler works). Therefore we needed to have indices as above (which the class `SubMatrix` had). To ensure that we add the correct blocks with each other, calling a `.join` was inevitable. Instead of storing the index inside both
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user mbhynes commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23226987 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,223 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowsPerBlock Number of rows that make up each block. + * @param colsPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowsPerBlock: Int, +val colsPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case (rowIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case (rowIndex: Int, innerIndex: Int, colIndex: Int) => +Utils.nonNegativeMod(rowIndex + colIndex * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowsPerBlock == r.rowsPerBlock) && + (this.colsPerBlock == r.colsPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { --- End diff -- @brkyvz @rezazadeh I'm wondering why you are using (Int,Int) as the block index, as opposed to a class that implements hashCode() to partition the matrix in lexicographic order. Then, the call rdd.partitionBy(new HashPartitioner(numRowBlocks*numColBlocks)) will assure the same partitioning between two distributed matrices. When performing addition or elementwise-multiplication with other distributed matrices, the operation A_ij + B_ij would be local without any shuffle. Could you please explain the use of (Int,Int) over an Index class? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- -
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70547665 [Test build #25766 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25766/consoleFull) for PR 3200 at commit [`239ab4b`](https://github.com/apache/spark/commit/239ab4b47d910428460160311165f620afb605fa). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70547676 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25766/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70540877 [Test build #25766 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25766/consoleFull) for PR 3200 at commit [`239ab4b`](https://github.com/apache/spark/commit/239ab4b47d910428460160311165f620afb605fa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23128478 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23118561 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBlock
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-70317456 @brkyvz Those are my only comments for now, except that the tests could also check that partitioner has the right info. Also, I wonder if it would be better to force the RDD partitioner to be a grid partitioner and to use that partitioner, rather than storing a new instance of the partitioner (which must be kept in synch with the actual partitioner). Users could construct block matrices via methods we provide, which would set up the RDD with the right partitioner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106797 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106790 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106791 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106803 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106793 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106801 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106792 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106795 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106782 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23106799 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23104441 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrixSuite.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import org.scalatest.FunSuite + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark.mllib.linalg.{DenseMatrix, Matrices, Matrix} +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class BlockMatrixSuite extends FunSuite with MLlibTestSparkContext { + + val m = 5 --- End diff -- I would put these fixed values in a private BlockMatrixSuite object and then import them inside the class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23104439 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23104436 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23104431 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, --- End diff -- Rename "rowPerBlock" --> "rowsPerBlock"? Same for "colPerBlock" and in other places in this file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23104435 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( +val numRowBlocks: Int, +val numColBlocks: Int, +val rdd: RDD[((Int, Int), Matrix)]) extends DistributedMatrix with Logging { + + type SubMatrix = ((Int, Int), Matrix) // ((blockRowIndex, blockColIndex), matrix) + + /** + * Alternate constructor for BlockMatrix without the input of a partitioner. Will use a Grid + * Partitioner by default. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ + def this( + numRowBlocks: Int, + numColBlocks: Int, + rdd: RDD[((Int, Int), Matrix)], + rowPerBlock: Int, + colPerBlock: Int) = { +this(numRowBlocks, numColBlocks, rdd) +val part = new GridPartitioner(numRowBlocks, numColBl
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r23104432 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,217 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.rdd.RDDFunctions._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * A grid partitioner, which stores every block in a separate partition. + * + * @param numRowBlocks Number of blocks that form the rows of the matrix. + * @param numColBlocks Number of blocks that form the columns of the matrix. + * @param rowPerBlock Number of rows that make up each block. + * @param colPerBlock Number of columns that make up each block. + */ +private[mllib] class GridPartitioner( +val numRowBlocks: Int, +val numColBlocks: Int, +val rowPerBlock: Int, +val colPerBlock: Int, +override val numPartitions: Int) extends Partitioner { + + /** + * Returns the index of the partition the SubMatrix belongs to. + * + * @param key The key for the SubMatrix. Can be its position in the grid (its column major index) + *or a tuple of three integers that are the final row index after the multiplication, + *the index of the block to multiply with, and the final column index after the + *multiplication. + * @return The index of the partition, which the SubMatrix belongs to. + */ + override def getPartition(key: Any): Int = { +key match { + case ind: (Int, Int) => +Utils.nonNegativeMod(ind._1 + ind._2 * numRowBlocks, numPartitions) + case indices: (Int, Int, Int) => +Utils.nonNegativeMod(indices._1 + indices._3 * numRowBlocks, numPartitions) + case _ => +throw new IllegalArgumentException("Unrecognized key") +} + } + + /** Checks whether the partitioners have the same characteristics */ + override def equals(obj: Any): Boolean = { +obj match { + case r: GridPartitioner => +(this.numPartitions == r.numPartitions) && (this.rowPerBlock == r.rowPerBlock) && + (this.colPerBlock == r.colPerBlock) + case _ => +false +} + } +} + +/** + * Represents a distributed matrix in blocks of local matrices. + * + * @param numRowBlocks Number of blocks that form the rows of this matrix + * @param numColBlocks Number of blocks that form the columns of this matrix + * @param rdd The RDD of SubMatrices (local matrices) that form this matrix + */ +class BlockMatrix( --- End diff -- I assume this constructor should be private: ``` class BlockMatrix private (... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-69873803 [Test build #25512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25512/consoleFull) for PR 3200 at commit [`ba414d2`](https://github.com/apache/spark/commit/ba414d2c6a16de987c9aa456dafa193d191014d5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-69873811 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25512/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-69869900 [Test build #25512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25512/consoleFull) for PR 3200 at commit [`ba414d2`](https://github.com/apache/spark/commit/ba414d2c6a16de987c9aa456dafa193d191014d5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-69866191 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25507/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-69866190 [Test build #25507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25507/consoleFull) for PR 3200 at commit [`ab6cde0`](https://github.com/apache/spark/commit/ab6cde0d90b917e89f97c86ef3c84dcdc64a9b57). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-69866135 [Test build #25507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25507/consoleFull) for PR 3200 at commit [`ab6cde0`](https://github.com/apache/spark/commit/ab6cde0d90b917e89f97c86ef3c84dcdc64a9b57). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63883302 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23679/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63883289 [Test build #23679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23679/consoleFull) for PR 3200 at commit [`9ae85aa`](https://github.com/apache/spark/commit/9ae85aa1ebabdc099d7f655bc1d9021d34d2910f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubMatrix(blockRowIndex: Int, blockColIndex: Int, mat: DenseMatrix) extends Serializable` * `class GridPartitioner(` * `class RowBasedPartitioner(` * `class ColumnBasedPartitioner(` * `class BlockMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63870038 [Test build #23679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23679/consoleFull) for PR 3200 at commit [`9ae85aa`](https://github.com/apache/spark/commit/9ae85aa1ebabdc099d7f655bc1d9021d34d2910f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63154793 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23402/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63154789 [Test build #23402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23402/consoleFull) for PR 3200 at commit [`d033861`](https://github.com/apache/spark/commit/d033861d5a2f88b223f601feb4445308399901e8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubMatrix(blockRowIndex: Int, blockColIndex: Int, mat: DenseMatrix) extends Serializable` * `class GridPartitioner(` * `class RowBasedPartitioner(` * `class ColumnBasedPartitioner(` * `class BlockMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63150236 [Test build #23402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23402/consoleFull) for PR 3200 at commit [`d033861`](https://github.com/apache/spark/commit/d033861d5a2f88b223f601feb4445308399901e8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63138048 [Test build #23383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23383/consoleFull) for PR 3200 at commit [`49b9586`](https://github.com/apache/spark/commit/49b9586a18cf7338a46c88a51ed23890914d3be4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubMatrix(blockRowIndex: Int, blockColIndex: Int, mat: DenseMatrix) extends Serializable` * `case class SubMatrixInfo(` * `abstract class BlockMatrixPartitioner(` * `class GridPartitioner(` * `class RowBasedPartitioner(` * `class ColumnBasedPartitioner(` * `class BlockMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63138057 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23383/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63130191 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23382/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63130184 [Test build #23382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23382/consoleFull) for PR 3200 at commit [`b05aabb`](https://github.com/apache/spark/commit/b05aabbdd5c3f5db69bbaff3582139b691d696fa). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63126875 [Test build #23383 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23383/consoleFull) for PR 3200 at commit [`49b9586`](https://github.com/apache/spark/commit/49b9586a18cf7338a46c88a51ed23890914d3be4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63123607 [Test build #23382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23382/consoleFull) for PR 3200 at commit [`b05aabb`](https://github.com/apache/spark/commit/b05aabbdd5c3f5db69bbaff3582139b691d696fa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63123158 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23379/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63123151 [Test build #23379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23379/consoleFull) for PR 3200 at commit [`19c17e8`](https://github.com/apache/spark/commit/19c17e8d1594a3f7bd5a973a09b341de3a1c857a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63122272 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23378/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63122251 [Test build #23378 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23378/consoleFull) for PR 3200 at commit [`589fbb6`](https://github.com/apache/spark/commit/589fbb65478851d88ea5a7f5bf54c1fa8d53f055). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubMatrix(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable` * `case class SubMatrixInfo(` * `abstract class BlockMatrixPartitioner(` * `class GridPartitioner(` * `class RowBasedPartitioner(` * `class ColumnBasedPartitioner(` * `class BlockMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63116214 [Test build #23379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23379/consoleFull) for PR 3200 at commit [`19c17e8`](https://github.com/apache/spark/commit/19c17e8d1594a3f7bd5a973a09b341de3a1c857a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-63115145 [Test build #23378 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23378/consoleFull) for PR 3200 at commit [`589fbb6`](https://github.com/apache/spark/commit/589fbb65478851d88ea5a7f5bf54c1fa8d53f055). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/3200#discussion_r20378789 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -0,0 +1,331 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.linalg.distributed + +import breeze.linalg.{DenseMatrix => BDM} + +import org.apache.spark._ +import org.apache.spark.mllib.linalg.DenseMatrix +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.util.Utils + +/** + * Represents a local matrix that makes up one block of a distributed BlockMatrix + * + * @param blockIdRow The row index of this block + * @param blockIdCol The column index of this block + * @param mat The underlying local matrix + */ +case class BlockPartition(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable + +/** + * Information about the BlockMatrix maintained on the driver + * + * @param partitionId The id of the partition the block is found in + * @param blockIdRow The row index of this block + * @param blockIdCol The column index of this block + * @param startRow The starting row index with respect to the distributed BlockMatrix + * @param numRows The number of rows in this block + * @param startCol The starting column index with respect to the distributed BlockMatrix + * @param numCols The number of columns in this block + */ +case class BlockPartitionInfo( +partitionId: Int, +blockIdRow: Int, +blockIdCol: Int, +startRow: Long, --- End diff -- That's a good question... I left it in there for the following reason, not sure if it's going to be ever required, but in the end, I thought it would be good to cover corner cases such as irregular grids: Assume you have a matrix A with dimensions 280 x d. Assume each `SubMatrix` has a dimension 30 x d/3. The last row will consist of SubMatrices 10 x d/3. Then you vertically append a Matrix B, with dimensions n x d. Then you're left with an irregular grid. Maybe vertical concatenation is not as common as horizontal concatenation, but being ready to support such operations seems beneficial for users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...
Github user brkyvz commented on the pull request: https://github.com/apache/spark/pull/3200#issuecomment-62982228 @mengxr > If we have two block matrices, A and B, and A's column block partitioning matches B's row block partitioning, can we take advantage of this fact in computing A * B? I support having only one block matrix partitioner implementation. Then we do the following: > > if (A.partitioner.colBlockPartitioner == B.partitioner.rowBlockPartitioner) { > // zip ... > } else { > ... > } By `partitioner.rowBlockPartitioner` and `partitioner.colBlockPartitioner`, are you talking about the number of blocks that form the rows and the number of rows per block match? One problem with zip was that I couldn't guarantee data locality. I tried to force it, but the best way to force it turns out to be a join... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org