[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169661#comment-16169661 ] Hadoop QA commented on SPARK-18406: --- [ https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167181#comment-16167181 ] Wenchen Fan commented on SPARK-18406: - https://github.com/apache/spark/pull/18099 is the PR that backported the fix to 2.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169586#comment-16169586 ] Hadoop QA commented on SPARK-18406: --- [ https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167180#comment-16167180 ] Yongqin Xiao commented on SPARK-18406: -- [~cloud_fan], I see there are 3 check-ins for this issue, touching multiple files. You mentioned the fix will be backport to spark2.1.0. Can you let me know which single submission in spark2.1.0 will address the issue? The reason I am asking is that my company may not update spark version to 2.2 very soon, I will have to port your fix to our company's version of spark 2.1.0 and 2.0.1. I cannot just use latest spark 2.1.0 even after you backport the fix because we have other patches on top of spark 2.1.0, some were fixed by ourselves. Thanks for your help. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at >
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473487#comment-15473487 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360 ] Nick Pentreath commented on SPARK-17400: Can you comment more on the performance issue - are you actually seeing this in practice? From the comment, it seems in most cases zeros in the input vector would be transformed to non-zeros, so I wonder how much benefit is gained from a sparse representation? In any case, it seems like a fairly easy possible win to use `SparseVector.compressed` here (e.g. see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473484#comment-15473484 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360 ] Nick Pentreath commented on SPARK-17400: Can you comment more on the performance issue - are you actually seeing this in practice? From the comment, it seems in most cases zeros in the input vector would be transformed to non-zeros, so I wonder how much benefit is gained from a sparse representation? In any case, it seems like a fairly easy possible win to use `SparseVector.compressed` here (e.g. see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473488#comment-15473488 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360 ] Nick Pentreath edited comment on SPARK-17400 at 9/5/16 7:42 AM: Can you comment more on the performance issue - are you actually seeing this in practice? From the comment, it seems in most cases zeros in the input vector would be transformed to non-zeros, so I wonder how much benefit is gained from a sparse representation? In any case, it seems like a fairly easy possible win to use {{SparseVector.compressed}} here (e.g. see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91) was (Author: mlnick): Can you comment more on the performance issue - are you actually seeing this in practice? From the comment, it seems in most cases zeros in the input vector would be transformed to non-zeros, so I wonder how much benefit is gained from a sparse representation? In any case, it seems like a fairly easy possible win to use `SparseVector.compressed` here (e.g. see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473486#comment-15473486 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360 ] Nick Pentreath edited comment on SPARK-17400 at 9/5/16 7:42 AM: Can you comment more on the performance issue - are you actually seeing this in practice? From the comment, it seems in most cases zeros in the input vector would be transformed to non-zeros, so I wonder how much benefit is gained from a sparse representation? In any case, it seems like a fairly easy possible win to use {{SparseVector.compressed}} here (e.g. see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91) was (Author: mlnick): Can you comment more on the performance issue - are you actually seeing this in practice? From the comment, it seems in most cases zeros in the input vector would be transformed to non-zeros, so I wonder how much benefit is gained from a sparse representation? In any case, it seems like a fairly easy possible win to use `SparseVector.compressed` here (e.g. see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala#L91) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472681#comment-15472681 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464236#comment-15464236 ] Hyukjin Kwon commented on SPARK-17339: -- [~sarutak] [~shivaram] Please cc me if any of you submit a PR so that I can run the build automation (as it is not merged yet). Otherwise, I can do this if you tell me which one is preferred. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472682#comment-15472682 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464236#comment-15464236 ] Hyukjin Kwon commented on SPARK-17339: -- [~sarutak] [~shivaram] Please cc me if any of you submit a PR so that I can run the build automation (as it is not merged yet). Otherwise, I can do this if you tell me which one is preferred. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472604#comment-15472604 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464244#comment-15464244 ] Shivaram Venkataraman commented on SPARK-17339: --- Thanks [~hyukjin.kwon] -- It will be great if you can try the `Utils.resolveURI` change as a PR and run that through the build automation tool. Also the reason I was trying to debug this today is I feel like it would be better to make the build green before merging the automation -- otherwise it might confuse other contributors etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472602#comment-15472602 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464244#comment-15464244 ] Shivaram Venkataraman commented on SPARK-17339: --- Thanks [~hyukjin.kwon] -- It will be great if you can try the `Utils.resolveURI` change as a PR and run that through the build automation tool. Also the reason I was trying to debug this today is I feel like it would be better to make the build green before merging the automation -- otherwise it might confuse other contributors etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472549#comment-15472549 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464247#comment-15464247 ] Kousuke Saruta commented on SPARK-17339: [~hyukjin.kwon] Go ahead and submit a PR. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472551#comment-15472551 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464247#comment-15464247 ] Kousuke Saruta commented on SPARK-17339: [~hyukjin.kwon] Go ahead and submit a PR. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472523#comment-15472523 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464250#comment-15464250 ] Hyukjin Kwon commented on SPARK-17339: -- Yeap, I totally agree. Thank you both! Will submit a PR within today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows
[ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472522#comment-15472522 ] Hadoop QA commented on SPARK-17339: --- [ https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464250#comment-15464250 ] Hyukjin Kwon commented on SPARK-17339: -- Yeap, I totally agree. Thank you both! Will submit a PR within today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Fix SparkR tests on Windows > --- > > Key: SPARK-17339 > URL: https://issues.apache.org/jira/browse/SPARK-17339 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > A number of SparkR tests are current failing when run on Windows as discussed > in https://github.com/apache/spark/pull/14743 > The list of tests that fail right now is at > https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134 > A full log from a build and test on AppVeyor is at > https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472413#comment-15472413 ] Hadoop QA commented on SPARK-17400: --- Frank Dai created SPARK-17400: - Summary: MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance Key: SPARK-17400 URL: https://issues.apache.org/jira/browse/SPARK-17400 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.0.0, 1.6.2, 1.6.1 Reporter: Frank Dai MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472409#comment-15472409 ] Hadoop QA commented on SPARK-17400: --- Frank Dai created SPARK-17400: - Summary: MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance Key: SPARK-17400 URL: https://issues.apache.org/jira/browse/SPARK-17400 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.0.0, 1.6.2, 1.6.1 Reporter: Frank Dai MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472381#comment-15472381 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Dai updated SPARK-17400: -- Description: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector Or we can make it configurable by adding a parameter to was: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472382#comment-15472382 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Dai updated SPARK-17400: -- Description: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector Or we can make it configurable by adding a parameter to was: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472372#comment-15472372 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Dai updated SPARK-17400: -- Description: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector Or we can make it configurable by adding a parameter to MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: Boolean), so that users can decide whether their output result is dense or sparse. was: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector Or we can make it configurable by adding a parameter to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472371#comment-15472371 ] Hadoop QA commented on SPARK-17400: --- [ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Dai updated SPARK-17400: -- Description: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector Or we can make it configurable by adding a parameter to MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: Boolean), so that users can decide whether their output result is dense or sparse. was: MinMaxScaler.transform() outputs DenseVector by default, which will cause poor performance and consume a lot of memory. The most important line of code is the following: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 I suggest that the code should calculate the number of non-zero elements in advance, if the number of non-zero elements is less than half of the total elements in the matrix, use SparseVector, otherwise use DenseVector Or we can make it configurable by adding a parameter to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > MinMaxScaler.transform() outputs DenseVector by default, which causes poor > performance > -- > > Key: SPARK-17400 > URL: https://issues.apache.org/jira/browse/SPARK-17400 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Frank Dai > > MinMaxScaler.transform() outputs DenseVector by default, which will cause > poor performance and consume a lot of memory. > The most important line of code is the following: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195 > I suggest that the code should calculate the number of non-zero elements in > advance, if the number of non-zero elements is less than half of the total > elements in the matrix, use SparseVector, otherwise use DenseVector > Or we can make it configurable by adding a parameter to > MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: > Boolean), so that users can decide whether their output result is dense or > sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org