[jira] [Commented] (SPARK-19131) Support "alter table drop partition [if exists]"
[ https://issues.apache.org/jira/browse/SPARK-19131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15810768#comment-15810768 ] lichenglin commented on SPARK-19131: Yes > Support "alter table drop partition [if exists]" > > > Key: SPARK-19131 > URL: https://issues.apache.org/jira/browse/SPARK-19131 > Project: Spark > Issue Type: New Feature >Affects Versions: 2.1.0 >Reporter: lichenglin > > {code} > val parts = client.getPartitions(hiveTable, s.asJava).asScala > if (parts.isEmpty && !ignoreIfNotExists) { > throw new AnalysisException( > s"No partition is dropped. One partition spec '$s' does not exist > in table '$table' " + > s"database '$db'") > } > parts.map(_.getValues) > {code} > Until 2.1.0,drop partition will throw a exception when no partition to drop. > I notice there is a param named ignoreIfNotExists. > But I don't know how to set it. > May be we can implement "alter table drop partition [if exists] " -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19131) Support "alter table drop partition [if exists]"
lichenglin created SPARK-19131: -- Summary: Support "alter table drop partition [if exists]" Key: SPARK-19131 URL: https://issues.apache.org/jira/browse/SPARK-19131 Project: Spark Issue Type: New Feature Affects Versions: 2.1.0 Reporter: lichenglin {code} val parts = client.getPartitions(hiveTable, s.asJava).asScala if (parts.isEmpty && !ignoreIfNotExists) { throw new AnalysisException( s"No partition is dropped. One partition spec '$s' does not exist in table '$table' " + s"database '$db'") } parts.map(_.getValues) {code} Until 2.1.0,drop partition will throw a exception when no partition to drop. I notice there is a param named ignoreIfNotExists. But I don't know how to set it. May be we can implement "alter table drop partition [if exists] " -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table
lichenglin created SPARK-19129: -- Summary: alter table table_name drop partition with a empty string will drop the whole table Key: SPARK-19129 URL: https://issues.apache.org/jira/browse/SPARK-19129 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: lichenglin {code} val spark = SparkSession .builder .appName("PartitionDropTest") .master("local[2]").enableHiveSupport() .getOrCreate() val sentenceData = spark.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"))) .toDF("id", "name") spark.sql("drop table if exists licllocal.partition_table") sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table") spark.sql("alter table licllocal.partition_table drop partition(id='')") spark.table("licllocal.partition_table").show() {code} the result is {code} |name| id| ++---+ ++---+ {code} Maybe the partition match have something wrong when the partition value is set to empty string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19075) Plz make MinMaxScaler can work with a Number type field
lichenglin created SPARK-19075: -- Summary: Plz make MinMaxScaler can work with a Number type field Key: SPARK-19075 URL: https://issues.apache.org/jira/browse/SPARK-19075 Project: Spark Issue Type: Wish Components: ML Affects Versions: 2.0.2 Reporter: lichenglin until now MinMaxScaler can only work with Vector type filed. Plz make it can work with double type and other number type. so do MaxAbsScaler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753304#comment-15753304 ] lichenglin commented on SPARK-18893: spark 2.0 has disable "alter table". [https://issues.apache.org/jira/browse/SPARK-14118] [https://issues.apache.org/jira/browse/SPARK-14130] I think this is very import feature for data warehouse Can spark handle it firstly? > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753146#comment-15753146 ] lichenglin edited comment on SPARK-14130 at 12/16/16 2:00 AM: -- "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? Even though it only works on hivecontext with specially fileformat was (Author: licl): "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14130: --- Comment: was deleted (was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? ) > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14130: --- Comment: was deleted (was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? ) > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753147#comment-15753147 ] lichenglin commented on SPARK-14130: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753148#comment-15753148 ] lichenglin commented on SPARK-14130: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753146#comment-15753146 ] lichenglin commented on SPARK-14130: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml
[ https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15669125#comment-15669125 ] lichenglin commented on SPARK-18441: Thanks ,It works now > Add Smote in spark mlib and ml > -- > > Key: SPARK-18441 > URL: https://issues.apache.org/jira/browse/SPARK-18441 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 2.0.1 >Reporter: lichenglin > > PLZ Add Smote in spark mlib and ml in case of the "not balance of train > data" for Classification -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml
[ https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15669050#comment-15669050 ] lichenglin commented on SPARK-18441: Thanks for your reply. May I ask what version of spark this Smote use. I'm using spark2.0.1. But some error occur because of the "org.apache.spark.ml.linalg.BLAS" my eclipse can't recognize this class. > Add Smote in spark mlib and ml > -- > > Key: SPARK-18441 > URL: https://issues.apache.org/jira/browse/SPARK-18441 > Project: Spark > Issue Type: Wish > Components: ML, MLlib >Affects Versions: 2.0.1 >Reporter: lichenglin > > PLZ Add Smote in spark mlib and ml in case of the "not balance of train > data" for Classification -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18441) Add Smote in spark mlib and ml
lichenglin created SPARK-18441: -- Summary: Add Smote in spark mlib and ml Key: SPARK-18441 URL: https://issues.apache.org/jira/browse/SPARK-18441 Project: Spark Issue Type: Wish Components: ML, MLlib Affects Versions: 2.0.1 Reporter: lichenglin PLZ Add Smote in spark mlib and ml in case of the "not balance of train data" for Classification -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662759#comment-15662759 ] lichenglin commented on SPARK-18413: I'm sorry,my network is too bad to download dependencies from maven rep for building spark. I have made a comment on your PR,please check if it is right. Thanks > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658921#comment-15658921 ] lichenglin commented on SPARK-18413: Sorry,I can't. I'm a rookie and have a really terrible network... > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658920#comment-15658920 ] lichenglin commented on SPARK-18413: Sorry,I can't. I'm a rookie and have a really terrible network... > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-18413: --- Comment: was deleted (was: Sorry,I can't. I'm a rookie and have a really terrible network... ) > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658562#comment-15658562 ] lichenglin edited comment on SPARK-18413 at 11/11/16 11:57 PM: --- {code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" numPartitions "10" ); {code} May be we can use numPartitions. until now numPartitions is just for reading was (Author: licl): {code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" numPartitions 10 ); {code} May be we can use numPartitions. util now numPartitions is just for reading > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658562#comment-15658562 ] lichenglin commented on SPARK-18413: {code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" numPartitions 10 ); {code} May be we can use numPartitions. util now numPartitions is just for reading > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658490#comment-15658490 ] lichenglin commented on SPARK-18413: I'm using spark sql,and how to call repartition with sql??? > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin reopened SPARK-18413: > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-18413: --- Description: {code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" ); --set spark.sql.shuffle.partitions=200 insert overwrite table resultview select g,count(1) as count from tnet.DT_LIVE_INFO group by g {code} I'm tring to save a spark sql result to oracle. And I found spark will create a jdbc connection for each partition. if the sql create to many partitions , the database can't hold so many connections and return exception. In above situation is 200 because of the "group by" and "spark.sql.shuffle.partitions" the spark source code JdbcUtil is {code} def saveTable( df: DataFrame, url: String, table: String, properties: Properties) { val dialect = JdbcDialects.get(url) val nullTypes: Array[Int] = df.schema.fields.map { field => getJdbcType(field.dataType, dialect).jdbcNullType } val rddSchema = df.schema val getConnection: () => Connection = createConnectionFactory(url, properties) val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt df.foreachPartition { iterator => savePartition(getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect) } } {code} May be we can add a property for df.repartition(num).foreachPartition ? In fact I got an exception "ORA-12519, TNS:no appropriate service handler found" was: {code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" ); --set spark.sql.shuffle.partitions=200 insert overwrite table resultview select g,count(1) as count from tnet.DT_LIVE_INFO group by g {code} I'm tring to save a spark sql result to oracle. And I found spark will create a jdbc connection for each partition. if the sql create to many partitions , the database can't hold so many connections and return exception. In above situation is 200 because of the "group by" and "spark.sql.shuffle.partitions" the spark source code JdbcUtil is {code} def saveTable( df: DataFrame, url: String, table: String, properties: Properties) { val dialect = JdbcDialects.get(url) val nullTypes: Array[Int] = df.schema.fields.map { field => getJdbcType(field.dataType, dialect).jdbcNullType } val rddSchema = df.schema val getConnection: () => Connection = createConnectionFactory(url, properties) val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt df.foreachPartition { iterator => savePartition(getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect) } } {code} May be we can add a property for df.repartition(num).foreachPartition ? > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519,
[jira] [Closed] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin closed SPARK-18413. -- Resolution: Invalid > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
lichenglin created SPARK-18413: -- Summary: Add a property to control the number of partitions when save a jdbc rdd Key: SPARK-18413 URL: https://issues.apache.org/jira/browse/SPARK-18413 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.0.1 Reporter: lichenglin {code} CREATE or replace TEMPORARY VIEW resultview USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", dbtable "result", user "HIVE", password "HIVE" ); --set spark.sql.shuffle.partitions=200 insert overwrite table resultview select g,count(1) as count from tnet.DT_LIVE_INFO group by g {code} I'm tring to save a spark sql result to oracle. And I found spark will create a jdbc connection for each partition. if the sql create to many partitions , the database can't hold so many connections and return exception. In above situation is 200 because of the "group by" and "spark.sql.shuffle.partitions" the spark source code JdbcUtil is {code} def saveTable( df: DataFrame, url: String, table: String, properties: Properties) { val dialect = JdbcDialects.get(url) val nullTypes: Array[Int] = df.schema.fields.map { field => getJdbcType(field.dataType, dialect).jdbcNullType } val rddSchema = df.schema val getConnection: () => Connection = createConnectionFactory(url, properties) val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt df.foreachPartition { iterator => savePartition(getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect) } } {code} May be we can add a property for df.repartition(num).foreachPartition ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15580857#comment-15580857 ] lichenglin commented on SPARK-17898: I have found a way to declaration the username and password: --repositories http://username:passw...@wx.bjdv.com:8081/nexus/content/groups/bigdata/ May be,we should add this feature to the document? > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573929#comment-15573929 ] lichenglin edited comment on SPARK-17898 at 10/14/16 2:41 AM: -- I know it. But how to build these dependencies into my jar. Could you give me an example with gradle?? Really thanks a lot. I have tried make a jar with mysql-jdbc.jar. And I have change the mainfest's classpath. It Failed. was (Author: licl): I know it. But how to build these dependencies into my jar. Could you give me an example with gradle?? Really thanks a lot. > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573929#comment-15573929 ] lichenglin commented on SPARK-17898: I know it. But how to build these dependencies into my jar. Could you give me an example with gradle?? Really thanks a lot. > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17898) --repositories needs username and password
lichenglin created SPARK-17898: -- Summary: --repositories needs username and password Key: SPARK-17898 URL: https://issues.apache.org/jira/browse/SPARK-17898 Project: Spark Issue Type: Wish Affects Versions: 2.0.1 Reporter: lichenglin My private repositories need username and password to visit. I can't find a way to declaration the username and password when submit spark application {code} bin/spark-submit --repositories http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages com.databricks:spark-csv_2.10:1.2.0 --class org.apache.spark.examples.SparkPi --master local[8] examples/jars/spark-examples_2.11-2.0.1.jar 100 {code} The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16517) can't add columns on the table create by spark'writer
[ https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-16517: --- Summary: can't add columns on the table create by spark'writer (was: can't add columns on the parquet table ) > can't add columns on the table create by spark'writer > -- > > Key: SPARK-16517 > URL: https://issues.apache.org/jira/browse/SPARK-16517 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > {code} > setName("abc"); > HiveContext hive = getHiveContext(); > DataFrame d = hive.createDataFrame( > getJavaSparkContext().parallelize( > > Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", > "abcd", 5.0))), > DataTypes.createStructType( > > Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, > true), > > DataTypes.createStructField("tag_name", DataTypes.StringType, true), > > DataTypes.createStructField("v", DataTypes.DoubleType, true; > d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc"); > hive.sql("alter table abc add columns(v2 double)"); > hive.refreshTable("abc"); > hive.sql("describe abc").show(); > DataFrame d2 = hive.createDataFrame( > > getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", > "abc", 3.0, 4.0), > RowFactory.create("abcd", > "abcd", 3.0, 1.0))), > new StructType(new StructField[] { > DataTypes.createStructField("card_id", DataTypes.StringType, true), > > DataTypes.createStructField("tag_name", DataTypes.StringType, true), > > DataTypes.createStructField("v", DataTypes.DoubleType, true), > > DataTypes.createStructField("v2", DataTypes.DoubleType, true) })); > d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc"); > hive.table("abc").show(); > {code} > spark.sql.parquet.mergeSchema has been set to "true". > The code's exception is here > {code} > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | card_id| string| | > |tag_name| string| | > | v| double| | > ++-+---+ > 2016-07-13 13:40:43,637 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : > db=default tbl=abc > 2016-07-13 13:40:43,637 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl > ip=unknown-ip-addr cmd=get_table : db=default tbl=abc > 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - > Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, > free: 1125.7 MB) > 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned > accumulator 2 > 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - > Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: > 1125.7 MB) > 2016-07-13 13:40:43,702 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : > db=default tbl=abc > 2016-07-13 13:40:43,703 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl > ip=unknown-ip-addr cmd=get_table : db=default tbl=abc > Exception in thread "main" java.lang.RuntimeException: > Relation[card_id#26,tag_name#27,v#28] ParquetRelation > requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE > statement generates the same number of columns as its schema. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68) > at > org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249) > at > org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply
[jira] [Updated] (SPARK-16517) can't add columns on the parquet table
[ https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-16517: --- Summary: can't add columns on the parquet table (was: can't add columns on the table witch column metadata is serializer) > can't add columns on the parquet table > --- > > Key: SPARK-16517 > URL: https://issues.apache.org/jira/browse/SPARK-16517 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > {code} > setName("abc"); > HiveContext hive = getHiveContext(); > DataFrame d = hive.createDataFrame( > getJavaSparkContext().parallelize( > > Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", > "abcd", 5.0))), > DataTypes.createStructType( > > Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, > true), > > DataTypes.createStructField("tag_name", DataTypes.StringType, true), > > DataTypes.createStructField("v", DataTypes.DoubleType, true; > d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc"); > hive.sql("alter table abc add columns(v2 double)"); > hive.refreshTable("abc"); > hive.sql("describe abc").show(); > DataFrame d2 = hive.createDataFrame( > > getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", > "abc", 3.0, 4.0), > RowFactory.create("abcd", > "abcd", 3.0, 1.0))), > new StructType(new StructField[] { > DataTypes.createStructField("card_id", DataTypes.StringType, true), > > DataTypes.createStructField("tag_name", DataTypes.StringType, true), > > DataTypes.createStructField("v", DataTypes.DoubleType, true), > > DataTypes.createStructField("v2", DataTypes.DoubleType, true) })); > d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc"); > hive.table("abc").show(); > {code} > spark.sql.parquet.mergeSchema has been set to "true". > The code's exception is here > {code} > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | card_id| string| | > |tag_name| string| | > | v| double| | > ++-+---+ > 2016-07-13 13:40:43,637 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : > db=default tbl=abc > 2016-07-13 13:40:43,637 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl > ip=unknown-ip-addr cmd=get_table : db=default tbl=abc > 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - > Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, > free: 1125.7 MB) > 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned > accumulator 2 > 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - > Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: > 1125.7 MB) > 2016-07-13 13:40:43,702 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : > db=default tbl=abc > 2016-07-13 13:40:43,703 INFO > [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl > ip=unknown-ip-addr cmd=get_table : db=default tbl=abc > Exception in thread "main" java.lang.RuntimeException: > Relation[card_id#26,tag_name#27,v#28] ParquetRelation > requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE > statement generates the same number of columns as its schema. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68) > at > org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249) > at > org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58) >
[jira] [Updated] (SPARK-16517) can't add columns on the table witch column metadata is serializer
[ https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-16517: --- Description: {code} setName("abc"); HiveContext hive = getHiveContext(); DataFrame d = hive.createDataFrame( getJavaSparkContext().parallelize( Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", "abcd", 5.0))), DataTypes.createStructType( Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, true), DataTypes.createStructField("tag_name", DataTypes.StringType, true), DataTypes.createStructField("v", DataTypes.DoubleType, true; d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc"); hive.sql("alter table abc add columns(v2 double)"); hive.refreshTable("abc"); hive.sql("describe abc").show(); DataFrame d2 = hive.createDataFrame( getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 3.0, 4.0), RowFactory.create("abcd", "abcd", 3.0, 1.0))), new StructType(new StructField[] { DataTypes.createStructField("card_id", DataTypes.StringType, true), DataTypes.createStructField("tag_name", DataTypes.StringType, true), DataTypes.createStructField("v", DataTypes.DoubleType, true), DataTypes.createStructField("v2", DataTypes.DoubleType, true) })); d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc"); hive.table("abc").show(); {code} spark.sql.parquet.mergeSchema has been set to "true". The code's exception is here {code} ++-+---+ |col_name|data_type|comment| ++-+---+ | card_id| string| | |tag_name| string| | | v| double| | ++-+---+ 2016-07-13 13:40:43,637 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : db=default tbl=abc 2016-07-13 13:40:43,637 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl ip=unknown-ip-addr cmd=get_table : db=default tbl=abc 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 1125.7 MB) 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned accumulator 2 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 1125.7 MB) 2016-07-13 13:40:43,702 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : db=default tbl=abc 2016-07-13 13:40:43,703 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl ip=unknown-ip-addr cmd=get_table : db=default tbl=abc Exception in thread "main" java.lang.RuntimeException: Relation[card_id#26,tag_name#27,v#28] ParquetRelation requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun
[jira] [Updated] (SPARK-16517) can't add columns on the table witch column metadata is serializer
[ https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-16517: --- Description: {code} setName("abc"); HiveContext hive = getHiveContext(); DataFrame d = hive.createDataFrame( getJavaSparkContext().parallelize( Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", "abcd", 5.0))), DataTypes.createStructType( Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, true), DataTypes.createStructField("tag_name", DataTypes.StringType, true), DataTypes.createStructField("v", DataTypes.DoubleType, true; d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc"); hive.sql("alter table abc add columns(v2 double)"); hive.refreshTable("abc"); hive.sql("describe abc").show(); DataFrame d2 = hive.createDataFrame( getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 3.0, 4.0), RowFactory.create("abcd", "abcd", 3.0, 1.0))), new StructType(new StructField[] { DataTypes.createStructField("card_id", DataTypes.StringType, true), DataTypes.createStructField("tag_name", DataTypes.StringType, true), DataTypes.createStructField("v", DataTypes.DoubleType, true), DataTypes.createStructField("v2", DataTypes.DoubleType, true) })); d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc"); hive.table("abc").show(); {code} spark.sql.parquet.mergeSchema has been set to "true". The code's exception is here {code} ++-+---+ |col_name|data_type|comment| ++-+---+ | card_id| string| | |tag_name| string| | | v| double| | ++-+---+ 2016-07-13 13:40:43,637 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : db=default tbl=abc 2016-07-13 13:40:43,637 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl ip=unknown-ip-addr cmd=get_table : db=default tbl=abc 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 1125.7 MB) 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned accumulator 2 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 1125.7 MB) 2016-07-13 13:40:43,702 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : db=default tbl=abc 2016-07-13 13:40:43,703 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl ip=unknown-ip-addr cmd=get_table : db=default tbl=abc Exception in thread "main" java.lang.RuntimeException: Relation[card_id#26,tag_name#27,v#28] ParquetRelation requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.colle
[jira] [Created] (SPARK-16517) can't add columns on the table witch column metadata is serializer
lichenglin created SPARK-16517: -- Summary: can't add columns on the table witch column metadata is serializer Key: SPARK-16517 URL: https://issues.apache.org/jira/browse/SPARK-16517 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Reporter: lichenglin {code} setName("abc"); HiveContext hive = getHiveContext(); DataFrame d = hive.createDataFrame( getJavaSparkContext().parallelize( Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", "abcd", 5.0))), DataTypes.createStructType( Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, true), DataTypes.createStructField("tag_name", DataTypes.StringType, true), DataTypes.createStructField("v", DataTypes.DoubleType, true; d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc"); hive.sql("alter table abc add columns(v2 double)"); hive.refreshTable("abc"); hive.sql("describe abc").show(); DataFrame d2 = hive.createDataFrame( getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 3.0, 4.0), RowFactory.create("abcd", "abcd", 3.0, 1.0))), new StructType(new StructField[] { DataTypes.createStructField("card_id", DataTypes.StringType, true), DataTypes.createStructField("tag_name", DataTypes.StringType, true), DataTypes.createStructField("v", DataTypes.DoubleType, true), DataTypes.createStructField("v2", DataTypes.DoubleType, true) })); d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc"); hive.table("abc").show(); {code} spark.sql.parquet.mergeSchema has been set to "true". The code's exception is here {code} ++-+---+ |col_name|data_type|comment| ++-+---+ | card_id| string| | |tag_name| string| | | v| double| | ++-+---+ 2016-07-13 13:40:43,637 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : db=default tbl=abc 2016-07-13 13:40:43,637 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl ip=unknown-ip-addr cmd=get_table : db=default tbl=abc 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 1125.7 MB) 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned accumulator 2 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 1125.7 MB) 2016-07-13 13:40:43,702 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : db=default tbl=abc 2016-07-13 13:40:43,703 INFO [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl ip=unknown-ip-addr cmd=get_table : db=default tbl=abc Exception in thread "main" java.lang.RuntimeException: Relation[card_id#26,tag_name#27,v#28] ParquetRelation requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58) at org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) at o
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362180#comment-15362180 ] lichenglin commented on SPARK-16361: I think a cube with just about 10 fields is familiar in OLAP system. Is there anything that we can do to reduce the memory for building cube?? > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-16361: --- Comment: was deleted (was: I have set master url in java application. here is a copy from spark master's ui. Completed Applications Application ID NameCores Memory per Node Submitted Time UserState Duration app-20160704175221-0090 no-name 12 40.0 GB 2016/07/04 17:52:21 root KILLED 2.1 h app-20160704174141-0089 no-name 12 20.0 GB 2016/07/04 17:41:41 root KILLED 10 min I add another two fields into the cube. the jobs both crash down.) > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361869#comment-15361869 ] lichenglin commented on SPARK-16361: I have set master url in java application. here is a copy from spark master's ui. Completed Applications Application ID NameCores Memory per Node Submitted Time UserState Duration app-20160704175221-0090 no-name 12 40.0 GB 2016/07/04 17:52:21 root KILLED 2.1 h app-20160704174141-0089 no-name 12 20.0 GB 2016/07/04 17:41:41 root KILLED 10 min I add another two fields into the cube. the jobs both crash down. > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361870#comment-15361870 ] lichenglin commented on SPARK-16361: I have set master url in java application. here is a copy from spark master's ui. Completed Applications Application ID NameCores Memory per Node Submitted Time UserState Duration app-20160704175221-0090 no-name 12 40.0 GB 2016/07/04 17:52:21 root KILLED 2.1 h app-20160704174141-0089 no-name 12 20.0 GB 2016/07/04 17:41:41 root KILLED 10 min I add another two fields into the cube. the jobs both crash down. > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361117#comment-15361117 ] lichenglin commented on SPARK-16361: GenerateUnsafeProjection: Code generated in 4.012162 ms The worker has a lot of log like this Is it important?? > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361112#comment-15361112 ] lichenglin commented on SPARK-16361: Here is my whole setting {code} spark.local.dir /home/sparktmp #spark.executor.cores4 spark.sql.parquet.cacheMetadata false spark.port.maxRetries5000 spark.kryoserializer.buffer.max 1024M spark.kryoserializer.buffer 5M spark.master spark://agent170:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://agent170:9000/sparklog spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory4g spark.driver.memory 2g #spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps spark.executor.extraClassPath=/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/mysql-connector-java-5.1.34.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/oracle-driver.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-4.6.0-HBase-1.1-client-whithoutlib-thriserver-fastxml.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-spark-4.6.0-HBase-1.1.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/spark-csv_2.10-1.3.0.jar spark.driver.extraClassPath=/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/mysql-connector-java-5.1.34.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/oracle-driver.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-4.6.0-HBase-1.1-client-whithoutlib-thriserver-fastxml.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-spark-4.6.0-HBase-1.1.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/spark-csv_2.10-1.3.0.jar {code} {code} export SPARK_WORKER_MEMORY=50g export SPARK_MASTER_OPTS=-Xmx4096m export HADOOP_CONF_DIR=/home/hadoop/hadoop-2.6.0 export HADOOP_HOME=/home/hadoop/hadoop-2.6.0 export SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://agent170:9000/sparklog {code} here is my command /home/hadoop/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --executor-memory 40g --executor-cores 12 --class com.bjdv.spark.job.cube.CubeDemo /home/hadoop/lib/licl/sparkjob.jar 2016-07-01 > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361073#comment-15361073 ] lichenglin commented on SPARK-16361: The data'size is 1 million. I'm sure that 40 GB memory is faster than the job with 20 GB. But building a cube with 1 million data need more than 40 GB memory to reduce the GC TIME. It's really not cool. > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361039#comment-15361039 ] lichenglin edited comment on SPARK-16361 at 7/4/16 9:03 AM: "A long time" means the gctime/Duration of each task. you can find it in the monitoring server in some stage. Every fields I add, this percent increase too ,until stuck the whole job. My data's size is about 1 million, 1 node with 16 cores and 64 GB memory. I have increase the memory of executor from 20 GB to 40 GB but not work well. was (Author: licl): "A long time" means the gctime/Duration of each task. you can find it in the monitoring server. Every fields I add, this percent increase too ,until stuck the whole job. My data's size is about 1 million, 1 node with 16 cores and 64 GB memory. I have increase the memory of executor from 20 GB to 40 GB but not work well. > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields
[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361039#comment-15361039 ] lichenglin commented on SPARK-16361: "A long time" means the gctime/Duration of each task. you can find it in the monitoring server. Every fields I add, this percent increase too ,until stuck the whole job. My data's size is about 1 million, 1 node with 16 cores and 64 GB memory. I have increase the memory of executor from 20 GB to 40 GB but not work well. > It takes a long time for gc when building cube with many fields > > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2 >Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > MetricMin 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 > 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16361) It takes a long time for gc when building cube with many fields
lichenglin created SPARK-16361: -- Summary: It takes a long time for gc when building cube with many fields Key: SPARK-16361 URL: https://issues.apache.org/jira/browse/SPARK-16361 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.2 Reporter: lichenglin I'm using spark to build cube on a dataframe with 1m data. I found that when I add too many fields (about 8 or above) the worker takes a lot of time for GC. I try to increase the memory of each worker but it not work well. but I don't know why,sorry. here is my simple code and monitoring Cuber is a util class for building cube. {code:title=Bar.java|borderStyle=solid} sqlContext.udf().register("jidu", (Integer f) -> { return (f - 1) / 3 + 1; } , DataTypes.IntegerType); DataFrame d = sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as double) as c_age", "month(day) as month", "year(day) as year", "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", "jidu(month(day)) as jidu"); Bucketizer b = new Bucketizer().setInputCol("c_age").setSplits(new double[] { Double.NEGATIVE_INFINITY, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, Double.POSITIVE_INFINITY }).setOutputCol("age"); DataFrame cube = new Cuber(b.transform(d)) .addFields("day", "AREA_CODE", "CUST_TYPE", "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") .min("age").sum("zwsc").count().buildcube(); cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); {code} Summary Metrics for 12 Completed Tasks Metric Min 25th percentile Median 75th percentile Max Duration2.6 min 2.7 min 2.7 min 2.7 min 2.7 min GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783 Shuffle Write Size / Records74.3 MB / 1926282 75.8 MB / 1965860 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStream for setting MqttConnectOptions
[ https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15900: --- Summary: please add a map param on MQTTUtils.createStream for setting MqttConnectOptions (was: please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions ) > please add a map param on MQTTUtils.createStream for setting > MqttConnectOptions > > > Key: SPARK-15900 > URL: https://issues.apache.org/jira/browse/SPARK-15900 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.1 >Reporter: lichenglin > > I notice that MQTTReceiver will create a connection with the method > (org.eclipse.paho.client.mqttv3.MqttClient) client.connect() > It just means client.connect(new MqttConnectOptions()); > this causes that we have to use the default MqttConnectOptions and can't set > other param like usename and password. > please add a new method at MQTTUtils.createStream like > createStream(jssc.ssc, brokerUrl, topic, map,storageLevel) > in order to make a none-default MqttConnectOptions. > thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions
[ https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15900: --- Summary: please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions (was: please add a map param on MQTTUtils.create for setting MqttConnectOptions ) > please add a map param on MQTTUtils.createStreamfor setting > MqttConnectOptions > --- > > Key: SPARK-15900 > URL: https://issues.apache.org/jira/browse/SPARK-15900 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.1 >Reporter: lichenglin > > I notice that MQTTReceiver will create a connection with the method > (org.eclipse.paho.client.mqttv3.MqttClient) client.connect() > It just means client.connect(new MqttConnectOptions()); > this causes that we have to use the default MqttConnectOptions and can't set > other param like usename and password. > please add a new method at MQTTUtils.createStream like > createStream(jssc.ssc, brokerUrl, topic, map,storageLevel) > in order to make a none-default MqttConnectOptions. > thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15900) please add a map param on MQTTUtils.create for setting MqttConnectOptions
lichenglin created SPARK-15900: -- Summary: please add a map param on MQTTUtils.create for setting MqttConnectOptions Key: SPARK-15900 URL: https://issues.apache.org/jira/browse/SPARK-15900 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.6.1 Reporter: lichenglin I notice that MQTTReceiver will create a connection with the method (org.eclipse.paho.client.mqttv3.MqttClient) client.connect() It just means client.connect(new MqttConnectOptions()); this causes that we have to use the default MqttConnectOptions and can't set other param like usename and password. please add a new method at MQTTUtils.createStream like createStream(jssc.ssc, brokerUrl, topic, map,storageLevel) in order to make a none-default MqttConnectOptions. thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15497) DecisionTreeClassificationModel can't be saved within in Pipeline caused by not implement Writable
lichenglin created SPARK-15497: -- Summary: DecisionTreeClassificationModel can't be saved within in Pipeline caused by not implement Writable Key: SPARK-15497 URL: https://issues.apache.org/jira/browse/SPARK-15497 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.1 Reporter: lichenglin Here is my code {code} SQLContext sqlContext = getSQLContext(); DataFrame data = sqlContext.read().format("libsvm").load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); // Index labels, adding metadata to the label column. // Fit on whole dataset to include all labels in index. StringIndexerModel labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data); // Automatically identify categorical features, and index them. VectorIndexerModel featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) // features with > 4 distinct values are treated as continuous .fit(data); // Split the data into training and test sets (30% held out for testing) DataFrame[] splits = data.randomSplit(new double[]{0.7, 0.3}); DataFrame trainingData = splits[0]; DataFrame testData = splits[1]; // Train a DecisionTree model. DecisionTreeClassifier dt = new DecisionTreeClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures"); // Convert indexed labels back to original labels. IndexToString labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels()); // Chain indexers and tree in a Pipeline Pipeline pipeline = new Pipeline() .setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter}); // Train model. This also runs the indexers. PipelineModel model = pipeline.fit(trainingData); model.save("file:///e:/tmpmodel"); {code} and here is the exception {code} Exception in thread "main" java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Non-Writable stage: dtc_7bdeae1c4fb8 of type class org.apache.spark.ml.classification.DecisionTreeClassificationModel at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:218) at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:215) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.ml.Pipeline$SharedReadWrite$.validateStages(Pipeline.scala:215) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.(Pipeline.scala:325) at org.apache.spark.ml.PipelineModel.write(Pipeline.scala:309) at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:131) at org.apache.spark.ml.PipelineModel.save(Pipeline.scala:280) at com.bjdv.spark.job.Testjob.main(Testjob.java:142) {code} sample_libsvm_data.txt is included in the 1.6.1 release tar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297540#comment-15297540 ] lichenglin commented on SPARK-15044: This exception is caused by “the HiveContext cache the metadata of the table” ; When you delete the hdfs, you must refresh table by using sqlContext.refreshTable(tableName); I think it's a bug too. especially when using hiveserver to query the table that has been delete by java application or other way,this exception occur too. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15478) LogisticRegressionModel coefficients() returns an empty vector
[ https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin closed SPARK-15478. -- Resolution: Not A Problem > LogisticRegressionModel coefficients() returns an empty vector > -- > > Key: SPARK-15478 > URL: https://issues.apache.org/jira/browse/SPARK-15478 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > I'm not sure this is a bug. > I'm runing the sample code like this > {code} > public static void main(String[] args) { > SQLContext sqlContext = getSQLContext(); > DataFrame training = sqlContext.read().format("libsvm") > > .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); > LogisticRegression lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); > // Fit the model > LogisticRegressionModel lrModel = lr.fit(training); > // Print the coefficients and intercept for logistic regression > System.out.println("Coefficients: " + lrModel.coefficients() + > " Intercept: " + lrModel.intercept()); > } > {code} > but the Coefficients always return a empty vector. > {code} > Coefficients: (692,[],[]) Intercept: 2.751535313041949 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15478) LogisticRegressionModel coefficients() returns an empty vector
[ https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296332#comment-15296332 ] lichenglin commented on SPARK-15478: Sorry,I made a mistake with wrong example data > LogisticRegressionModel coefficients() returns an empty vector > -- > > Key: SPARK-15478 > URL: https://issues.apache.org/jira/browse/SPARK-15478 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > I'm not sure this is a bug. > I'm runing the sample code like this > {code} > public static void main(String[] args) { > SQLContext sqlContext = getSQLContext(); > DataFrame training = sqlContext.read().format("libsvm") > > .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); > LogisticRegression lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); > // Fit the model > LogisticRegressionModel lrModel = lr.fit(training); > // Print the coefficients and intercept for logistic regression > System.out.println("Coefficients: " + lrModel.coefficients() + > " Intercept: " + lrModel.intercept()); > } > {code} > but the Coefficients always return a empty vector. > {code} > Coefficients: (692,[],[]) Intercept: 2.751535313041949 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15478) LogisticRegressionModel 's Coefficients always return a empty vector
[ https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15478: --- Description: I'm not sure this is a bug. I'm runing the sample code like this {code} public static void main(String[] args) { SQLContext sqlContext = getSQLContext(); DataFrame training = sqlContext.read().format("libsvm") .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); // Fit the model LogisticRegressionModel lrModel = lr.fit(training); // Print the coefficients and intercept for logistic regression System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept()); } {code} but the Coefficients always return a empty vector. {code} Coefficients: (692,[],[]) Intercept: 2.751535313041949 {code} was: I don't know if this is a bug. I'm runing the sample code like this {code} public static void main(String[] args) { SQLContext sqlContext = getSQLContext(); DataFrame training = sqlContext.read().format("libsvm") .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); // Fit the model LogisticRegressionModel lrModel = lr.fit(training); // Print the coefficients and intercept for logistic regression System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept()); } {code} but the Coefficients always return a empty vector. {code} Coefficients: (692,[],[]) Intercept: 2.751535313041949 {code} > LogisticRegressionModel 's Coefficients always return a empty vector > -- > > Key: SPARK-15478 > URL: https://issues.apache.org/jira/browse/SPARK-15478 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > I'm not sure this is a bug. > I'm runing the sample code like this > {code} > public static void main(String[] args) { > SQLContext sqlContext = getSQLContext(); > DataFrame training = sqlContext.read().format("libsvm") > > .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); > LogisticRegression lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); > // Fit the model > LogisticRegressionModel lrModel = lr.fit(training); > // Print the coefficients and intercept for logistic regression > System.out.println("Coefficients: " + lrModel.coefficients() + > " Intercept: " + lrModel.intercept()); > } > {code} > but the Coefficients always return a empty vector. > {code} > Coefficients: (692,[],[]) Intercept: 2.751535313041949 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15478) LogisticRegressionModel 's Coefficients always return a empty vector
[ https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15478: --- Summary: LogisticRegressionModel 's Coefficients always return a empty vector (was: LogisticRegressionModel 's Coefficients always be a empty vector) > LogisticRegressionModel 's Coefficients always return a empty vector > -- > > Key: SPARK-15478 > URL: https://issues.apache.org/jira/browse/SPARK-15478 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > I don't know if this is a bug. > I'm runing the sample code like this > {code} > public static void main(String[] args) { > SQLContext sqlContext = getSQLContext(); > DataFrame training = sqlContext.read().format("libsvm") > > .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); > LogisticRegression lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); > // Fit the model > LogisticRegressionModel lrModel = lr.fit(training); > // Print the coefficients and intercept for logistic regression > System.out.println("Coefficients: " + lrModel.coefficients() + > " Intercept: " + lrModel.intercept()); > } > {code} > but the Coefficients always return a empty vector. > {code} > Coefficients: (692,[],[]) Intercept: 2.751535313041949 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15478) LogisticRegressionModel 's Coefficients always be a empty vector
[ https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15478: --- Description: I don't know if this is a bug. I'm runing the sample code like this {code} public static void main(String[] args) { SQLContext sqlContext = getSQLContext(); DataFrame training = sqlContext.read().format("libsvm") .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); // Fit the model LogisticRegressionModel lrModel = lr.fit(training); // Print the coefficients and intercept for logistic regression System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept()); } {code} but the Coefficients always return a empty vector. {code} Coefficients: (692,[],[]) Intercept: 2.751535313041949 {code} was: I don't know if this is a bug. I'm runing the sample code like this public static void main(String[] args) { SQLContext sqlContext = getSQLContext(); DataFrame training = sqlContext.read().format("libsvm") .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); // Fit the model LogisticRegressionModel lrModel = lr.fit(training); // Print the coefficients and intercept for logistic regression System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept()); } but the Coefficients always return a empty vector. Coefficients: (692,[],[]) Intercept: 2.751535313041949 > LogisticRegressionModel 's Coefficients always be a empty vector > -- > > Key: SPARK-15478 > URL: https://issues.apache.org/jira/browse/SPARK-15478 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > I don't know if this is a bug. > I'm runing the sample code like this > {code} > public static void main(String[] args) { > SQLContext sqlContext = getSQLContext(); > DataFrame training = sqlContext.read().format("libsvm") > > .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); > LogisticRegression lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); > // Fit the model > LogisticRegressionModel lrModel = lr.fit(training); > // Print the coefficients and intercept for logistic regression > System.out.println("Coefficients: " + lrModel.coefficients() + > " Intercept: " + lrModel.intercept()); > } > {code} > but the Coefficients always return a empty vector. > {code} > Coefficients: (692,[],[]) Intercept: 2.751535313041949 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15478) LogisticRegressionModel 's Coefficients always be a empty vector
lichenglin created SPARK-15478: -- Summary: LogisticRegressionModel 's Coefficients always be a empty vector Key: SPARK-15478 URL: https://issues.apache.org/jira/browse/SPARK-15478 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.1 Reporter: lichenglin I don't know if this is a bug. I'm runing the sample code like this public static void main(String[] args) { SQLContext sqlContext = getSQLContext(); DataFrame training = sqlContext.read().format("libsvm") .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt"); LogisticRegression lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8); // Fit the model LogisticRegressionModel lrModel = lr.fit(training); // Print the coefficients and intercept for logistic regression System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept()); } but the Coefficients always return a empty vector. Coefficients: (692,[],[]) Intercept: 2.751535313041949 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14886: --- Description: {code} @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } {code} "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when pred's size less then k. That meas the true relevant documents has less size then the param k. just try this with sample_movielens_data.txt for example set pred.size to 5,labSetSize to 10,k to 20,then the n is 10. pred[10] not exists; precisionAt is ok just because it has val n = math.min(pred.length, k) was: {code} @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } {code} "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when pred's size less then k. That meas the true relevant documents has less size then the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) > RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException > -- > > Key: SPARK-14886 > URL: https://issues.apache.org/jira/browse/SPARK-14886 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin >Priority: Minor > > {code} > @Since("1.2.0") > def ndcgAt(k: Int): Double = { > require(k > 0, "ranking position k should be positive") > predictionAndLabels.map { case (pred, lab) => > val labSet = lab.toSet > if (labSet.nonEmpty) { > val labSetSize = labSet.size > val n = math.min(math.max(pred.length, labSetSize), k) > var maxDcg = 0.0 > var dcg = 0.0 > var i = 0 > while (i < n) { > val gain = 1.0 / math.log(i + 2) > if (labSet.contains(pred(i))) { > dcg += gain > } > if (i < labSetSize) { > maxDcg += gain > } > i += 1 > } > dcg / maxDcg > } else { > logWarning("Empty ground truth set, check input data") > 0.0 > } > }.mean() > } > {code} > "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException > when pred's size less then k. > That meas the true relevant documents has less size then the param k. > just try this with sample_movielens_data.txt > for example set pred.size to 5,labSetSize to 10,k to 20,then the n is 10. > pred[10] not exists; > precisionAt is ok just because it has > val n = math.min(pred.length, k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14886: --- Description: @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when pred's size less then k. That meas the true relevant documents has less size then the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) was: @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when the true relevant documents has less size the the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) > RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException > -- > > Key: SPARK-14886 > URL: https://issues.apache.org/jira/browse/SPARK-14886 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: lichenglin > > @Since("1.2.0") > def ndcgAt(k: Int): Double = { > require(k > 0, "ranking position k should be positive") > predictionAndLabels.map { case (pred, lab) => > val labSet = lab.toSet > if (labSet.nonEmpty) { > val labSetSize = labSet.size > val n = math.min(math.max(pred.length, labSetSize), k) > var maxDcg = 0.0 > var dcg = 0.0 > var i = 0 > while (i < n) { > val gain = 1.0 / math.log(i + 2) > if (labSet.contains(pred(i))) { > dcg += gain > } > if (i < labSetSize) { > maxDcg += gain > } > i += 1 > } > dcg / maxDcg > } else { > logWarning("Empty ground truth set, check input data") > 0.0 > } > }.mean() > } > "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException > when pred's size less then k. > That meas the true relevant documents has less size then the param k. > just try this with sample_movielens_data.txt > precisionAt is ok just because it has > val n = math.min(pred.length, k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
lichenglin created SPARK-14886: -- Summary: RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException Key: SPARK-14886 URL: https://issues.apache.org/jira/browse/SPARK-14886 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.1 Reporter: lichenglin @Since("1.2.0") def ndcgAt(k: Int): Double = { require(k > 0, "ranking position k should be positive") predictionAndLabels.map { case (pred, lab) => val labSet = lab.toSet if (labSet.nonEmpty) { val labSetSize = labSet.size val n = math.min(math.max(pred.length, labSetSize), k) var maxDcg = 0.0 var dcg = 0.0 var i = 0 while (i < n) { val gain = 1.0 / math.log(i + 2) if (labSet.contains(pred(i))) { dcg += gain } if (i < labSetSize) { maxDcg += gain } i += 1 } dcg / maxDcg } else { logWarning("Empty ground truth set, check input data") 0.0 } }.mean() } if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when the true relevant documents has less size the the param k. just try this with sample_movielens_data.txt precisionAt is ok just because it has val n = math.min(pred.length, k) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13999) Run 'group by' before building cube
lichenglin created SPARK-13999: -- Summary: Run 'group by' before building cube Key: SPARK-13999 URL: https://issues.apache.org/jira/browse/SPARK-13999 Project: Spark Issue Type: Improvement Reporter: lichenglin When I'm trying to build a cube on a data set witch has about 1 billion count. The cube has 7 dimensions. It takes a whole day to finish the job with 16 cores; Then I run the 'select count (1) from table group by A,B,C,D,E,F,G' first and run the cube with the 'group by' result data set. The dimensions is the same as 'group by' and do sum on 'count'. It just need 45 minutes. the group by will reduce the data set's count from billions to millions. This depends on the number of dimension. We can try in the new version. The process of averaging may be complex.Should get the sum and count during the group by . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin
[ https://issues.apache.org/jira/browse/SPARK-13907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin closed SPARK-13907. -- > Imporvement the cube with the Fast Cubing In apache Kylin > - > > Key: SPARK-13907 > URL: https://issues.apache.org/jira/browse/SPARK-13907 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: lichenglin > > I tried to build a cube on a 100 million data set. > When I set 9 fields to build the cube with 10 cores. > It nearly coast me a whole day to finish the job. > At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. > Could we refer to the ”fast cube“ algorithm in apache Kylin > To make the cube builder more quickly??? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin
[ https://issues.apache.org/jira/browse/SPARK-13907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-13907: --- Description: I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make the cube builder more quickly??? was: I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make the cube builder more quickly??? For example group(A,B,C)'s result can be create by the groupdata(A,B) > Imporvement the cube with the Fast Cubing In apache Kylin > - > > Key: SPARK-13907 > URL: https://issues.apache.org/jira/browse/SPARK-13907 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: lichenglin > > I tried to build a cube on a 100 million data set. > When I set 9 fields to build the cube with 10 cores. > It nearly coast me a whole day to finish the job. > At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. > Could we refer to the ”fast cube“ algorithm in apache Kylin > To make the cube builder more quickly??? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin
[ https://issues.apache.org/jira/browse/SPARK-13907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-13907: --- Description: I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make the cube builder more quickly??? For example group(A,B,C)'s result can be create by the groupdata(A,B) was: I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make the cube builder more quickly??? > Imporvement the cube with the Fast Cubing In apache Kylin > - > > Key: SPARK-13907 > URL: https://issues.apache.org/jira/browse/SPARK-13907 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: lichenglin > > I tried to build a cube on a 100 million data set. > When I set 9 fields to build the cube with 10 cores. > It nearly coast me a whole day to finish the job. > At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. > Could we refer to the ”fast cube“ algorithm in apache Kylin > To make the cube builder more quickly??? > For example > group(A,B,C)'s result can be create by the groupdata(A,B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin
lichenglin created SPARK-13907: -- Summary: Imporvement the cube with the Fast Cubing In apache Kylin Key: SPARK-13907 URL: https://issues.apache.org/jira/browse/SPARK-13907 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.1 Reporter: lichenglin I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make the cube builder more quickly??? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157801#comment-15157801 ] lichenglin commented on SPARK-13433: I know the property 'spark.dirver.cores' What I want to limit is the total count , not the count of single driver. So What should I do if the drivers have used all the cores??? There is no core for app ,and then driver will never stop and free their resources.Is it correct? > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > A Running driver at least use 1 core may be more. > When I submit a lot of job to the standalone server in cluster mode. > all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources(cores and memory) for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157802#comment-15157802 ] lichenglin commented on SPARK-13433: I know the property 'spark.dirver.cores' What I want to limit is the total count , not the count of single driver. So What should I do if the drivers have used all the cores??? There is no core for app ,and then driver will never stop and free their resources.Is it correct? > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > A Running driver at least use 1 core may be more. > When I submit a lot of job to the standalone server in cluster mode. > all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources(cores and memory) for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157121#comment-15157121 ] lichenglin commented on SPARK-13433: What I mean is We should set a limit on the total cores for running driver. For example,the cluster has 16 cores,the limit may be 10. If the 11th driver come in , it should be set submited to make sure the left 6 is used to run app. > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > A Running driver at least use 1 core may be more. > When I submit a lot of job to the standalone server in cluster mode. > all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources(cores and memory) for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157118#comment-15157118 ] lichenglin commented on SPARK-13433: But When? When someting else frees up resources?? All the cores is used for running drivers,the app will never get any core,so there is no driver will free it's resource. > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > A Running driver at least use 1 core may be more. > When I submit a lot of job to the standalone server in cluster mode. > all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources(cores and memory) for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157050#comment-15157050 ] lichenglin commented on SPARK-13433: It's something like deadlock. driver use all cores> application has no cores to be finished>the driver won't be finish>driver keep the core>driver use all cores; if the application won't be finished, the driver won't be finished too. I mean that We should not let the driver use all the cores, to make sure that the application have at least one core to do it's job. > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > A Running driver at least use 1 core may be more. > When I submit a lot of job to the standalone server in cluster mode. > all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources(cores and memory) for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-13433: --- Description: I have a 16 cores cluster. A Running driver at least use 1 core may be more. When I submit a lot of job to the standalone server in cluster mode. all the cores may be used for running driver, and then there is no cores to run applications The server is stuck. So I think we should limit the resources(cores and memory) for running driver. was: I have a 16 cores cluster. When I submit a lot of job to the standalone server in cluster mode. A Running driver at least use 1 core may be more. So all the cores may be used for running driver, and then there is no cores to run applications The server is stuck. So I think we should limit the resources for running driver. > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > A Running driver at least use 1 core may be more. > When I submit a lot of job to the standalone server in cluster mode. > all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources(cores and memory) for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers
[ https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-13433: --- Description: I have a 16 cores cluster. When I submit a lot of job to the standalone server in cluster mode. A Running driver at least use 1 core may be more. So all the cores may be used for running driver, and then there is no cores to run applications The server is stuck. So I think we should limit the resources for running driver. was: I have a 16 cores cluster. When I submit a lot of job to the standalone server in cluster mode. A Running driver at least use 1 core may be more. So all the cores may be used for running driver, and then there is no cores to run applications The server is stuck. So I think we should limit the cores for running driver. Summary: The standalone server should limit the count of cores and memory for running Drivers (was: The standalone should limit the count of Running Drivers) > The standalone server should limit the count of cores and memory for > running Drivers > -- > > Key: SPARK-13433 > URL: https://issues.apache.org/jira/browse/SPARK-13433 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: lichenglin > > I have a 16 cores cluster. > When I submit a lot of job to the standalone server in cluster mode. > A Running driver at least use 1 core may be more. > So all the cores may be used for running driver, > and then there is no cores to run applications > The server is stuck. > So I think we should limit the resources for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13433) The standalone should limit the count of Running Drivers
lichenglin created SPARK-13433: -- Summary: The standalone should limit the count of Running Drivers Key: SPARK-13433 URL: https://issues.apache.org/jira/browse/SPARK-13433 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.6.0 Reporter: lichenglin I have a 16 cores cluster. When I submit a lot of job to the standalone server in cluster mode. A Running driver at least use 1 core may be more. So all the cores may be used for running driver, and then there is no cores to run applications The server is stuck. So I think we should limit the cores for running driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121065#comment-15121065 ] lichenglin commented on SPARK-12963: I think the driver's ip should be stationary like 'localhost' not depended on any property > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121053#comment-15121053 ] lichenglin commented on SPARK-12963: I think the driver's ip should be stationary like 'localhost' not depended on any property > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
lichenglin created SPARK-12963: -- Summary: In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries! Key: SPARK-12963 URL: https://issues.apache.org/jira/browse/SPARK-12963 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.6.0 Reporter: lichenglin Priority: Critical I have 3 node cluster:namenode second and data1; I use this shell to submit job on namenode: bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc --total-executor-cores 5 --master spark://namenode:6066 hdfs://namenode:9000/sparkjars/spark.jar The Driver may be started on the other node such as data1. The problem is : when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode the driver will be started with this param such as SPARK_LOCAL_IP=namenode but the driver will start at data1, the dirver will try to binding the ip 'namenode' on data1. so driver will throw exception like this: Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org