[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-09-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169661#comment-16169661
 ] 

Hadoop QA commented on SPARK-18406:
---


[ 
https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167181#comment-16167181
 ] 

Wenchen Fan commented on SPARK-18406:
-

https://github.com/apache/spark/pull/18099 is the PR that backported the fix to 
2.1




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO 

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-09-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169586#comment-16169586
 ] 

Hadoop QA commented on SPARK-18406:
---


[ 
https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167180#comment-16167180
 ] 

Yongqin Xiao commented on SPARK-18406:
--

[~cloud_fan], I see there are 3 check-ins for this issue, touching multiple 
files. You mentioned the fix will be backport to spark2.1.0. Can you let me 
know which single submission in spark2.1.0 will address the issue?
The reason I am asking is that my company may not update spark version to 2.2 
very soon, I will have to port your fix to our company's version of spark 2.1.0 
and 2.0.1. I cannot just use latest spark 2.1.0 even after you backport the fix 
because we have other patches on top of spark 2.1.0, some were fixed by 
ourselves.
Thanks for your help.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> 

[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473487#comment-15473487
 ] 

Hadoop QA commented on SPARK-17400:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360
 ] 

Nick Pentreath commented on SPARK-17400:


Can you comment more on the performance issue - are you actually seeing this in 
practice? From the comment, it seems in most cases zeros in the input vector 
would be transformed to non-zeros, so I wonder how much benefit is gained from 
a sparse representation?

In any case, it seems like a fairly easy possible win to use 
`SparseVector.compressed` here (e.g. see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473484#comment-15473484
 ] 

Hadoop QA commented on SPARK-17400:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360
 ] 

Nick Pentreath commented on SPARK-17400:


Can you comment more on the performance issue - are you actually seeing this in 
practice? From the comment, it seems in most cases zeros in the input vector 
would be transformed to non-zeros, so I wonder how much benefit is gained from 
a sparse representation?

In any case, it seems like a fairly easy possible win to use 
`SparseVector.compressed` here (e.g. see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473488#comment-15473488
 ] 

Hadoop QA commented on SPARK-17400:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360
 ] 

Nick Pentreath edited comment on SPARK-17400 at 9/5/16 7:42 AM:


Can you comment more on the performance issue - are you actually seeing this in 
practice? From the comment, it seems in most cases zeros in the input vector 
would be transformed to non-zeros, so I wonder how much benefit is gained from 
a sparse representation?

In any case, it seems like a fairly easy possible win to use 
{{SparseVector.compressed}} here (e.g. see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91)


was (Author: mlnick):
Can you comment more on the performance issue - are you actually seeing this in 
practice? From the comment, it seems in most cases zeros in the input vector 
would be transformed to non-zeros, so I wonder how much benefit is gained from 
a sparse representation?

In any case, it seems like a fairly easy possible win to use 
`SparseVector.compressed` here (e.g. see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473486#comment-15473486
 ] 

Hadoop QA commented on SPARK-17400:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360
 ] 

Nick Pentreath edited comment on SPARK-17400 at 9/5/16 7:42 AM:


Can you comment more on the performance issue - are you actually seeing this in 
practice? From the comment, it seems in most cases zeros in the input vector 
would be transformed to non-zeros, so I wonder how much benefit is gained from 
a sparse representation?

In any case, it seems like a fairly easy possible win to use 
{{SparseVector.compressed}} here (e.g. see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91)


was (Author: mlnick):
Can you comment more on the performance issue - are you actually seeing this in 
practice? From the comment, it seems in most cases zeros in the input vector 
would be transformed to non-zeros, so I wonder how much benefit is gained from 
a sparse representation?

In any case, it seems like a fairly easy possible win to use 
`SparseVector.compressed` here (e.g. see 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472681#comment-15472681
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464236#comment-15464236
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

[~sarutak] [~shivaram] Please cc me if any of you submit a PR so that I can run 
the build automation (as it is not merged yet). Otherwise, I can do this if you 
tell me which one is preferred.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472682#comment-15472682
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464236#comment-15464236
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

[~sarutak] [~shivaram] Please cc me if any of you submit a PR so that I can run 
the build automation (as it is not merged yet). Otherwise, I can do this if you 
tell me which one is preferred.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472604#comment-15472604
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464244#comment-15464244
 ] 

Shivaram Venkataraman commented on SPARK-17339:
---

Thanks [~hyukjin.kwon] -- It will be great if you can try the 
`Utils.resolveURI` change as a PR and run that through the build automation 
tool. 

Also the reason I was trying to debug this today is I feel like it would be 
better to make the build green before merging the automation -- otherwise it 
might confuse other contributors etc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472602#comment-15472602
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464244#comment-15464244
 ] 

Shivaram Venkataraman commented on SPARK-17339:
---

Thanks [~hyukjin.kwon] -- It will be great if you can try the 
`Utils.resolveURI` change as a PR and run that through the build automation 
tool. 

Also the reason I was trying to debug this today is I feel like it would be 
better to make the build green before merging the automation -- otherwise it 
might confuse other contributors etc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472549#comment-15472549
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464247#comment-15464247
 ] 

Kousuke Saruta commented on SPARK-17339:


[~hyukjin.kwon] Go ahead and submit a PR. Thanks!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472551#comment-15472551
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464247#comment-15464247
 ] 

Kousuke Saruta commented on SPARK-17339:


[~hyukjin.kwon] Go ahead and submit a PR. Thanks!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472523#comment-15472523
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464250#comment-15464250
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

Yeap, I totally agree. Thank you both! Will submit a PR within today.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472522#comment-15472522
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464250#comment-15464250
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

Yeap, I totally agree. Thank you both! Will submit a PR within today.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472413#comment-15472413
 ] 

Hadoop QA commented on SPARK-17400:
---

Frank Dai created SPARK-17400:
-

 Summary: MinMaxScaler.transform() outputs DenseVector by default, 
which causes poor performance
 Key: SPARK-17400
 URL: https://issues.apache.org/jira/browse/SPARK-17400
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Frank Dai


MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472409#comment-15472409
 ] 

Hadoop QA commented on SPARK-17400:
---

Frank Dai created SPARK-17400:
-

 Summary: MinMaxScaler.transform() outputs DenseVector by default, 
which causes poor performance
 Key: SPARK-17400
 URL: https://issues.apache.org/jira/browse/SPARK-17400
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Frank Dai


MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472381#comment-15472381
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472382#comment-15472382
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472372#comment-15472372
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 
MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: Boolean), 
so that users can decide whether  their output result is dense or sparse.

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472371#comment-15472371
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 
MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: Boolean), 
so that users can decide whether  their output result is dense or sparse.

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org