[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418526#comment-15418526 ] 胡振宇 commented on SPARK-14850: - I try to run your code on spark1.6.1 but i found that "toDF" cannot be used in this example Here are my code object Example{ def main (args:Array[String]){ case class Test(num:Int,vector:Vector) val conf = new SparkConf.setAppname("Example") val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val temp=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map(i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0.toDF() //at this step toDF can be used I do } } sc.parallelize(0 until 1e4.toInt, 1).map { i => (i, Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF.rdd.count() I even use sparkcontext but toDF cannot be used too Do you have a solution to run the example on spark1.6.1? Thank you } > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418547#comment-15418547 ] 胡振宇 commented on SPARK-14850: - /*code is for spark 1.6.1*/ object Example{ def main (args:Array[String]){ val conf = new SparkConf.setAppname("Example") val sc=new sparkContext(conf) val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{ i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF().rdd.count() //at this step toDF can be used on Spark1.6.1 } } so I am not able to test the simple serialization example > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418534#comment-15418534 ] Wenchen Fan commented on SPARK-14850: - format your code please, it's unreadable > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418527#comment-15418527 ] 胡振宇 commented on SPARK-14850: - I try to run your code on spark1.6.1 but i found that "toDF" cannot be used in this example Here are my code object Example{ def main (args:Array[String]){ case class Test(num:Int,vector:Vector) val conf = new SparkConf.setAppname("Example") val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val temp=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map(i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0.toDF() //at this step toDF can be used I do } } sc.parallelize(0 until 1e4.toInt, 1).map { i => (i, Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF.rdd.count() I even use sparkcontext but toDF cannot be used too Do you have a solution to run the example on spark1.6.1? Thank you } > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265330#comment-15265330 ] Apache Spark commented on SPARK-14850: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/12814 > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265182#comment-15265182 ] Apache Spark commented on SPARK-14850: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/12805 > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255282#comment-15255282 ] Apache Spark commented on SPARK-14850: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/12640 > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Priority: Blocker > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254538#comment-15254538 ] Xiangrui Meng commented on SPARK-14850: --- Ran the following code with different Spark versions: {code} sc.parallelize(0 until 1e4.toInt, 1).map { i => (i, Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF.rdd.count() {code} Durations: * 1.4.1: 22s * 1.5.2: 282s * 1.6.0: 360s * 1.6.1: 340s So it is about 15x slow down on serialization. > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org