[GitHub] spark pull request #21695: Maintining an order

2018-07-02 Thread nagpall
Github user nagpall closed the pull request at:

https://github.com/apache/spark/pull/21695


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529567#comment-16529567
 ] 

Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:41 AM:
--

[~tygert] Can you take a look at the following PR 
https://github.com/apache/spark/pull/21695


was (Author: nagpall):
[~tygert] Can you take a look at the following PR 

[PR|[https://github.com/apache/spark/pull/21695|http://example.com/]]

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529567#comment-16529567
 ] 

Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:40 AM:
--

[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com/]



was (Author: nagpall):
[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com]

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529567#comment-16529567
 ] 

Anuj Nagpall edited comment on SPARK-8614 at 7/2/18 10:40 AM:
--

[~tygert] Can you take a look at the following PR 

[PR|[https://github.com/apache/spark/pull/21695|http://example.com/]]


was (Author: nagpall):
[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com/]


> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2018-07-02 Thread Anuj Nagpall (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529567#comment-16529567
 ] 

Anuj Nagpall commented on SPARK-8614:
-

[~tygert] Can you take a look at the following PR 
[https://github.com/apache/spark/pull/21695|http://example.com]

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>Priority: Major
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[GitHub] spark pull request #21695: Maintining an order

2018-07-02 Thread nagpall
GitHub user nagpall opened a pull request:

https://github.com/apache/spark/pull/21695

Maintining an order

## What is the problem?
In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices 
are dropped before calling the methods from RowMatrix.
For the IndexedRowMatrix.multiply I have observed that ordering within 
partitions is preserved, but that it seems to get mixed up between partitions. 
For example, for:

part1Index1 part1Vector1
part1Index2 part1Vector2
part2Index1 part2Vector1
part2Index2 part2Vector2

I got:

part2Index1 part1Vector1
part2Index2 part1Vector2
part1Index1 part2Vector1
part1Index2 part2Vector2

You can find the more details here :
[https://issues.apache.org/jira/browse/SPARK-8614](url)

## What changes were proposed in this pull request?
Instead of converting IndexedRowMatrix to RowMatrix and loosing index, we 
are keeping it IndexedRowMatrix and taking out index and row matrix and then 
multiplying the row with matrix and placing it at right index.

## How was this patch tested?
With this changes all Ut's are passing for mllib module. 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nagpall/spark patch-spark-8614

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21695.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21695


commit d833d1e2020dd45e063aeb56f7649f766a4a1635
Author: Anuj Nagpal 
Date:   2018-07-02T08:57:12Z

Maintining an order




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[jira] [Created] (SPARK-24693) Row order preservation for operations on MLlib IndexedRowMatrix

2018-06-29 Thread Anuj Nagpall (JIRA)
Anuj Nagpall created SPARK-24693:


 Summary: Row order preservation for operations on MLlib 
IndexedRowMatrix
 Key: SPARK-24693
 URL: https://issues.apache.org/jira/browse/SPARK-24693
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Anuj Nagpall


In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
dropped before calling the methods from RowMatrix. For example for 
IndexedRowMatrix.computeSVD:

   val svd = toRowMatrix().computeSVD(k, computeU, rCond)

and for IndexedRowMatrix.multiply:

   val mat = toRowMatrix().multiply(B).

After computing these results, they are zipped with the original indices, e.g. 
for IndexedRowMatrix.computeSVD

   val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
  IndexedRow(i, v)
   }

and for IndexedRowMatrix.multiply:
   
   val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
  IndexedRow(i, v)
   }

I have experienced that for IndexedRowMatrix.computeSVD().U and 
IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
indices can get mixed (when running Spark jobs with multiple 
executors/machines): i.e. the vectors and indices of the result do not seem to 
correspond anymore. 

To me it looks like this is caused by zipping RDDs that have a different 
ordering?

For the IndexedRowMatrix.multiply I have observed that ordering within 
partitions is preserved, but that it seems to get mixed up between partitions. 
For example, for:

part1Index1 part1Vector1
part1Index2 part1Vector2
part2Index1 part2Vector1
part2Index2 part2Vector2

I got:

part2Index1 part1Vector1
part2Index2 part1Vector2
part1Index1 part2Vector1
part1Index2 part2Vector2

Another observation is that the mapPartitions in RowMatrix.multiply :

val AB = rows.mapPartitions { iter =>

had an "preservesPartitioning = true" argument in version 1.0, but this is no 
longer there.










--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org