[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175204#comment-15175204 ] Xiao Li commented on SPARK-13337: - To get your results, try using left outer join + right out join + union distinct. : ) > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Description: hot.count()=599147 ghot.size=21844 [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. hot: org.apache.spark.sql.DataFrame = [] scala> val ghot=hot.groupBy("Num","pNum").count().collect() Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... scala> ghot.take(20) res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) scala> hot.groupBy("Num","pNum").count().show() Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 4194304 bytes, TID = 202 +--+-+-+ | QQNum| TroopNum|count| +--+-+-+ |1X|38XXX|1| |1X| 5XXX|2| |1X|26XXX|6| |1X|14XXX|3| |1X|41XXX| 14| |1X|48XXX| 18| |1X|23XXX|2| |1X| XXX| 34| |1X|52XXX|1| |1X|52XXX|2| |1X|49XXX|3| |1X|42XXX|3| |1X|17XXX| 11| |1X|25XXX| 129| |1X|13XXX|2| |1X|19XXX|1| |1X|32XXX|9| |1X|38XXX|6| |1X|38XXX| 13| |1X|30XXX|4| +--+-+-+ only showing top 20 rows was: [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. hot: org.apache.spark.sql.DataFrame = [] scala> val ghot=hot.groupBy("Num","pNum").count().collect() Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+,
[jira] [Assigned] (SPARK-13543) Support for specifying compression codec for Parquet/ORC via option()
[ https://issues.apache.org/jira/browse/SPARK-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13543: Assignee: (was: Apache Spark) > Support for specifying compression codec for Parquet/ORC via option() > - > > Key: SPARK-13543 > URL: https://issues.apache.org/jira/browse/SPARK-13543 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Likewise, SPARK-12871, SPARK-12872 and SPARK-13503, the compression codec can > be set via {{option()}} for Parquet and ORC rather than manually setting them > to Hadoop configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13543) Support for specifying compression codec for Parquet/ORC via option()
[ https://issues.apache.org/jira/browse/SPARK-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13543: Assignee: Apache Spark > Support for specifying compression codec for Parquet/ORC via option() > - > > Key: SPARK-13543 > URL: https://issues.apache.org/jira/browse/SPARK-13543 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Likewise, SPARK-12871, SPARK-12872 and SPARK-13503, the compression codec can > be set via {{option()}} for Parquet and ORC rather than manually setting them > to Hadoop configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13543) Support for specifying compression codec for Parquet/ORC via option()
[ https://issues.apache.org/jira/browse/SPARK-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175190#comment-15175190 ] Apache Spark commented on SPARK-13543: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/11464 > Support for specifying compression codec for Parquet/ORC via option() > - > > Key: SPARK-13543 > URL: https://issues.apache.org/jira/browse/SPARK-13543 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Likewise, SPARK-12871, SPARK-12872 and SPARK-13503, the compression codec can > be set via {{option()}} for Parquet and ORC rather than manually setting them > to Hadoop configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Attachment: memory leak.png > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Attachment: (was: memory leak.png) > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory.png > > > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Summary: show() trigger memory leak,why? (was: show() trigger memory leak) > show() trigger memory leak,why? > --- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > hot: org.apache.spark.sql.DataFrame = [] > scala> val ghot=hot.groupBy("Num","pNum").count().collect() > Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... > scala> ghot.take(20) > res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) > scala> hot.groupBy("Num","pNum").count().show() > Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly disable SSL by setting useSSL=false, > or set useSSL=true and provide truststore for server certificate verification. > 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = > 4194304 bytes, TID = 202 > +--+-+-+ > | QQNum| TroopNum|count| > +--+-+-+ > |1X|38XXX|1| > |1X| 5XXX|2| > |1X|26XXX|6| > |1X|14XXX|3| > |1X|41XXX| 14| > |1X|48XXX| 18| > |1X|23XXX|2| > |1X| XXX| 34| > |1X|52XXX|1| > |1X|52XXX|2| > |1X|49XXX|3| > |1X|42XXX|3| > |1X|17XXX| 11| > |1X|25XXX| 129| > |1X|13XXX|2| > |1X|19XXX|1| > |1X|32XXX|9| > |1X|38XXX|6| > |1X|38XXX| 13| > |1X|30XXX|4| > +--+-+-+ > only showing top 20 rows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13614) show() trigger memory leak
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Description: [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. hot: org.apache.spark.sql.DataFrame = [] scala> val ghot=hot.groupBy("Num","pNum").count().collect() Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310... scala> ghot.take(20) res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[]) scala> hot.groupBy("Num","pNum").count().show() Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 4194304 bytes, TID = 202 +--+-+-+ | QQNum| TroopNum|count| +--+-+-+ |1X|38XXX|1| |1X| 5XXX|2| |1X|26XXX|6| |1X|14XXX|3| |1X|41XXX| 14| |1X|48XXX| 18| |1X|23XXX|2| |1X| XXX| 34| |1X|52XXX|1| |1X|52XXX|2| |1X|49XXX|3| |1X|42XXX|3| |1X|17XXX| 11| |1X|25XXX| 129| |1X|13XXX|2| |1X|19XXX|1| |1X|32XXX|9| |1X|38XXX|6| |1X|38XXX| 13| |1X|30XXX|4| +--+-+-+ only showing top 20 rows > show() trigger memory leak > -- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > > [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell > --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> > "jdbc:mysql://:/?user=&password=","dbtable" -> "")).load() > Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without > server's identity verification is not recommended. According to MySQL > 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established > by default if explicit option isn't set. For compliance with existing > applications not using SSL the verifyServerCertificate property is set to > 'false'. You need either to explicitly
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154 ] Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM: suppose we are joining two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| |null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case was (Author: zwang): suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| |null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13613) Provide ignored tests to export test dataset into CSV format
[ https://issues.apache.org/jira/browse/SPARK-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13613: Assignee: Apache Spark > Provide ignored tests to export test dataset into CSV format > > > Key: SPARK-13613 > URL: https://issues.apache.org/jira/browse/SPARK-13613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Provide ignored test to export the test dataset into CSV format in > LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite > and GeneralizedLinearRegressionSuite, so users can validate the training > accuracy compared with R's glm, glmnet and survival package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13613) Provide ignored tests to export test dataset into CSV format
[ https://issues.apache.org/jira/browse/SPARK-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13613: Assignee: (was: Apache Spark) > Provide ignored tests to export test dataset into CSV format > > > Key: SPARK-13613 > URL: https://issues.apache.org/jira/browse/SPARK-13613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Provide ignored test to export the test dataset into CSV format in > LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite > and GeneralizedLinearRegressionSuite, so users can validate the training > accuracy compared with R's glm, glmnet and survival package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13613) Provide ignored tests to export test dataset into CSV format
[ https://issues.apache.org/jira/browse/SPARK-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175156#comment-15175156 ] Apache Spark commented on SPARK-13613: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/11463 > Provide ignored tests to export test dataset into CSV format > > > Key: SPARK-13613 > URL: https://issues.apache.org/jira/browse/SPARK-13613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > Provide ignored test to export the test dataset into CSV format in > LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite > and GeneralizedLinearRegressionSuite, so users can validate the training > accuracy compared with R's glm, glmnet and survival package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13614) show() trigger memory leak
[ https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chillon_m updated SPARK-13614: -- Attachment: memory leak.png memory.png > show() trigger memory leak > -- > > Key: SPARK-13614 > URL: https://issues.apache.org/jira/browse/SPARK-13614 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2 >Reporter: chillon_m > Attachments: memory leak.png, memory.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154 ] Zhong Wang commented on SPARK-13337: suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175154#comment-15175154 ] Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM: suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| |null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case was (Author: zwang): suppose we have two tables: -- TableA ||key1||key2||value1|| |null|k1|v1| |k2|k3|v2| TableB ||key1||key2||value2|| |null|k1|v3| |k4|k5|v4| The result table I want is: -- TableC ||key1||key2||value1||value2|| |null|k1|v1|v3| |k2|k3|v2|null| |k4|k5|null|v4| We cannot use the current join-using-columns interface, because it doesn't support null-safe joins, and we have null values in the first row We cannot use join-select with explicit "<=>" neither, because the output table will be like: -- ||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2|| |null|k1|null|k1|v1|v3| |k2|k3|null|null|v2|null| null|null|k4|k5|null|v4| it is difficult to get the result like TableC using select cause, because the null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* columns Hope this makes sense to you. I'd like to submit a pr if this is a real use case > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13614) show() trigger memory leak
chillon_m created SPARK-13614: - Summary: show() trigger memory leak Key: SPARK-13614 URL: https://issues.apache.org/jira/browse/SPARK-13614 Project: Spark Issue Type: Question Components: SQL Affects Versions: 1.5.2 Reporter: chillon_m -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class jav
[ https://issues.apache.org/jira/browse/SPARK-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao closed SPARK-13608. --- Resolution: Not A Problem > py4j.Py4JException: Method createDirectStream([class > org.apache.spark.streaming.api.java.JavaStreamingContext, class > java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does > not exist > - > > Key: SPARK-13608 > URL: https://issues.apache.org/jira/browse/SPARK-13608 > Project: Spark > Issue Type: Bug >Reporter: Avatar Zhang > > py4j.Py4JException: Method createDirectStream([class > org.apache.spark.streaming.api.java.JavaStreamingContext, class > java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does > not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class
[ https://issues.apache.org/jira/browse/SPARK-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175148#comment-15175148 ] Avatar Zhang commented on SPARK-13608: -- i used a bad version spark-streaming-kafka-assembly. thanks. problem resolved > py4j.Py4JException: Method createDirectStream([class > org.apache.spark.streaming.api.java.JavaStreamingContext, class > java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does > not exist > - > > Key: SPARK-13608 > URL: https://issues.apache.org/jira/browse/SPARK-13608 > Project: Spark > Issue Type: Bug >Reporter: Avatar Zhang > > py4j.Py4JException: Method createDirectStream([class > org.apache.spark.streaming.api.java.JavaStreamingContext, class > java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does > not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class
[ https://issues.apache.org/jira/browse/SPARK-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175144#comment-15175144 ] Saisai Shao commented on SPARK-13608: - Hi [~avatarzhang] , would you please elaborate your problem, how you use this API, and do you have a Spark Streaming Kafka assembly jar loaded in the environment? > py4j.Py4JException: Method createDirectStream([class > org.apache.spark.streaming.api.java.JavaStreamingContext, class > java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does > not exist > - > > Key: SPARK-13608 > URL: https://issues.apache.org/jira/browse/SPARK-13608 > Project: Spark > Issue Type: Bug >Reporter: Avatar Zhang > > py4j.Py4JException: Method createDirectStream([class > org.apache.spark.streaming.api.java.JavaStreamingContext, class > java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does > not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13613) Provide ignored tests to export test dataset into CSV format
Yanbo Liang created SPARK-13613: --- Summary: Provide ignored tests to export test dataset into CSV format Key: SPARK-13613 URL: https://issues.apache.org/jira/browse/SPARK-13613 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang Priority: Minor Provide ignored test to export the test dataset into CSV format in LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite and GeneralizedLinearRegressionSuite, so users can validate the training accuracy compared with R's glm, glmnet and survival package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175107#comment-15175107 ] Xiao Li commented on SPARK-13219: - Hi, [~velvia] after a discussion with Michael, he prefers to enhancing the existing Constraints for resolving this issue. Will reimplement the whole thing based on the new framework. Thanks! > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175097#comment-15175097 ] Xiao Li commented on SPARK-13337: - What is the null columns? If you are using full outer joins, all the columns in the result sets of joins could be null columns. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
[ https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175094#comment-15175094 ] Apache Spark commented on SPARK-12941: -- User 'thomastechs' has created a pull request for this issue: https://github.com/apache/spark/pull/11462 > Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR > datatype > -- > > Key: SPARK-12941 > URL: https://issues.apache.org/jira/browse/SPARK-12941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: Apache Spark 1.4.2.2 >Reporter: Jose Martinez Poblete >Assignee: Thomas Sebastian > Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0 > > > When exporting data from Spark to Oracle, string datatypes are translated to > TEXT for Oracle, this is leading to the following error > {noformat} > java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype > {noformat} > As per the following code: > https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144 > See also: > http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175095#comment-15175095 ] Xiao Li commented on SPARK-13393: - Thank you! [~adrian-wang] Sorry, [~srinathsmn] I missed your reply. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage
[ https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175093#comment-15175093 ] Chip Senkbeil commented on SPARK-13573: --- I'd gladly create a PR with the changes if needed. We haven't synced with Spark 1.6.0+ yet, so it'd just take me a little bit to get up to speed. Other than the one new method to enable connecting without creating a Spark Context, it's just exporting functions and switching the RBackend class to be public. > Open SparkR APIs (R package) to allow better 3rd party usage > > > Key: SPARK-13573 > URL: https://issues.apache.org/jira/browse/SPARK-13573 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Chip Senkbeil > > Currently, SparkR's R package does not expose enough of its APIs to be used > flexibly. That I am aware of, SparkR still requires you to create a new > SparkContext by invoking the sparkR.init method (so you cannot connect to a > running one) and there is no way to invoke custom Java methods using the > exposed SparkR API (unlike PySpark). > We currently maintain a fork of SparkR that is used to power the R > implementation of Apache Toree, which is a gateway to use Apache Spark. This > fork provides a connect method (to use an existing Spark Context), exposes > needed methods like invokeJava (to be able to communicate with our JVM to > retrieve code to run, etc), and uses reflection to access > org.apache.spark.api.r.RBackend. > Here is the documentation I recorded regarding changes we need to enable > SparkR as an option for Apache Toree: > https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage
[ https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175089#comment-15175089 ] Chip Senkbeil commented on SPARK-13573: --- In terms of the JVM class whose methods we are invoking, the majority can be found here: https://github.com/apache/incubator-toree/blob/master/kernel-api/src/main/scala/org/apache/toree/interpreter/broker/BrokerState.scala#L90 We basically maintain an object that acts as a code queue where the SparkR process pulls off code to evaluate and then sends back results as strings. We also had to write a wrapper for the RBackend since it was package protected: https://github.com/apache/incubator-toree/blob/master/sparkr-interpreter/src/main/scala/org/apache/toree/kernel/interpreter/sparkr/ReflectiveRBackend.scala > Open SparkR APIs (R package) to allow better 3rd party usage > > > Key: SPARK-13573 > URL: https://issues.apache.org/jira/browse/SPARK-13573 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Chip Senkbeil > > Currently, SparkR's R package does not expose enough of its APIs to be used > flexibly. That I am aware of, SparkR still requires you to create a new > SparkContext by invoking the sparkR.init method (so you cannot connect to a > running one) and there is no way to invoke custom Java methods using the > exposed SparkR API (unlike PySpark). > We currently maintain a fork of SparkR that is used to power the R > implementation of Apache Toree, which is a gateway to use Apache Spark. This > fork provides a connect method (to use an existing Spark Context), exposes > needed methods like invokeJava (to be able to communicate with our JVM to > retrieve code to run, etc), and uses reflection to access > org.apache.spark.api.r.RBackend. > Here is the documentation I recorded regarding changes we need to enable > SparkR as an option for Apache Toree: > https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure
[ https://issues.apache.org/jira/browse/SPARK-13607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13607: Assignee: (was: Apache Spark) > Improves compression performance for integer-typed values on cache to reduce > GC pressure > > > Key: SPARK-13607 > URL: https://issues.apache.org/jira/browse/SPARK-13607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure
[ https://issues.apache.org/jira/browse/SPARK-13607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175085#comment-15175085 ] Apache Spark commented on SPARK-13607: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/11461 > Improves compression performance for integer-typed values on cache to reduce > GC pressure > > > Key: SPARK-13607 > URL: https://issues.apache.org/jira/browse/SPARK-13607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure
[ https://issues.apache.org/jira/browse/SPARK-13607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13607: Assignee: Apache Spark > Improves compression performance for integer-typed values on cache to reduce > GC pressure > > > Key: SPARK-13607 > URL: https://issues.apache.org/jira/browse/SPARK-13607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage
[ https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175079#comment-15175079 ] Chip Senkbeil commented on SPARK-13573: --- [~sunrui], IIRC, Toree supported SparkR from 1.4.x and 1.5.x. Just a bit of a pain to keep in sync. So, the process Toree uses the methods to interact with SparkR is as follows: # We added a SparkR.connect method (https://github.com/apache/incubator-toree/blob/master/sparkr-interpreter/src/main/resources/R/pkg/R/sparkR.R#L220) that uses the EXISTING_SPARKR_BACKEND_PORT to connect to an R backend but does not attempt to initialize the Spark Context # We use the exposed callJStatic to acquire a reference to a Java (well, Scala) object that has additional variables like the Spark Context hanging off of it (https://github.com/apache/incubator-toree/blob/master/sparkr-interpreter/src/main/resources/kernelR/sparkr_runner.R#L50) {code}# Retrieve the bridge used to perform actions on the JVM bridge <- callJStatic( "org.apache.toree.kernel.interpreter.sparkr.SparkRBridge", "sparkRBridge" ) # Retrieve the state used to pull code off the JVM and push results back state <- callJMethod(bridge, "state") # Acquire the kernel API instance to expose kernel <- callJMethod(bridge, "kernel") assign("kernel", kernel, .runnerEnv){code} # We then invoke methods using callJMethod to get the next string of R code to evaluate {code}# Load the conainer of the code codeContainer <- callJMethod(state, "nextCode") # If not valid result, wait 1 second and try again if (!class(codeContainer) == "jobj") { Sys.sleep(1) next() } # Retrieve the code id (for response) and code codeId <- callJMethod(codeContainer, "codeId") code <- callJMethod(codeContainer, "code"){code} # Finally, we evaluate the acquired code string and send the results back to our running JVM (which represents a Jupyter kernel) {code} # Parse the code into an expression to be evaluated codeExpr <- parse(text = code) print(paste("Code expr", codeExpr)) tryCatch({ # Evaluate the code provided and capture the result as a string result <- capture.output(eval(codeExpr, envir = .runnerEnv)) print(paste("Result type", class(result), length(result))) print(paste("Success", codeId, result)) # Mark the execution as a success and send back the result # If output is null/empty, ensure that we can send it (otherwise fails) if (is.null(result) || length(result) <= 0) { print("Marking success with no output") callJMethod(state, "markSuccess", codeId) } else { # Clean the result before sending it back cleanedResult <- trimws(flatten(result, shouldTrim = FALSE)) print(paste("Marking success with output:", cleanedResult)) callJMethod(state, "markSuccess", codeId, cleanedResult) } }, error = function(ex) { # Mark the execution as a failure and send back the error print(paste("Failure", codeId, toString(ex))) callJMethod(state, "markFailure", codeId, toString(ex)) }){code} > Open SparkR APIs (R package) to allow better 3rd party usage > > > Key: SPARK-13573 > URL: https://issues.apache.org/jira/browse/SPARK-13573 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Chip Senkbeil > > Currently, SparkR's R package does not expose enough of its APIs to be used > flexibly. That I am aware of, SparkR still requires you to create a new > SparkContext by invoking the sparkR.init method (so you cannot connect to a > running one) and there is no way to invoke custom Java methods using the > exposed SparkR API (unlike PySpark). > We currently maintain a fork of SparkR that is used to power the R > implementation of Apache Toree, which is a gateway to use Apache Spark. This > fork provides a connect method (to use an existing Spark Context), exposes > needed methods like invokeJava (to be able to communicate with our JVM to > retrieve code to run, etc), and uses reflection to access > org.apache.spark.api.r.RBackend. > Here is the documentation I recorded regarding changes we need to enable > SparkR as an option for Apache Toree: > https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13435) Add Weighted Cohen's kappa to MulticlassMetrics
[ https://issues.apache.org/jira/browse/SPARK-13435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13435: -- Shepherd: (was: Xiangrui Meng) > Add Weighted Cohen's kappa to MulticlassMetrics > --- > > Key: SPARK-13435 > URL: https://issues.apache.org/jira/browse/SPARK-13435 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > Add the missing Weighted Cohen's kappa to MulticlassMetrics. > Kappa is widely used in Competition and Statistics. > https://en.wikipedia.org/wiki/Cohen's_kappa > Some usage examples: > val metrics = new MulticlassMetrics(predictionAndLabels) > // The default kappa value (Unweighted kappa) > val kappa = metrics.kappa > // Three built-in weighting type ("default":unweighted, "linear":linear > weighted, "quadratic":quadratic weighted) > val kappa = metrics.kappa("quadratic") > // User-defined weighting matrix > val matrix = Matrices.dense(n, n, values) > val kappa = metrics.kappa(matrix) > // User-defined weighting function > def getWeight(i: Int, j:Int):Double = { > if (i == j) { > 0.0 > } else { > 1.0 > } > } > val kappa = metrics.kappa(getWeight) // equals to the unweighted kappa > The calculation correctness was tested on several small data, and compared to > two python's package: sklearn and ml_metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175066#comment-15175066 ] Reynold Xin edited comment on SPARK-12177 at 3/2/16 5:32 AM: - This thread is getting to long for me to follow, but my instinct is that maybe we should have two subprojects and support both. Otherwise it is very bad for Kafka 0.8 users when upgrading to Spark 2.0. It's much more difficult to upgrade Kafka which is a message bus than just upgrading Spark. was (Author: rxin): This thread is getting to long for me to follow, but my instinct is that maybe we should have two subprojects and support both. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175066#comment-15175066 ] Reynold Xin commented on SPARK-12177: - This thread is getting to long for me to follow, but my instinct is that maybe we should have two subprojects and support both. > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13322: -- Shepherd: DB Tsai (was: Xiangrui Meng) > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity because we do not standardize the > feature before fitting model, we should support feature standardization. > Another benefit is that standardization will improve the convergence rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13010) Survival analysis in SparkR
[ https://issues.apache.org/jira/browse/SPARK-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13010: -- Shepherd: yuhao yang (was: Xiangrui Meng) > Survival analysis in SparkR > --- > > Key: SPARK-13010 > URL: https://issues.apache.org/jira/browse/SPARK-13010 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > Implement a simple wrapper of AFTSurvivalRegression in SparkR to support > survival analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13008) Make ML Python package all list have one algorithm per line
[ https://issues.apache.org/jira/browse/SPARK-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13008. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10927 [https://github.com/apache/spark/pull/10927] > Make ML Python package all list have one algorithm per line > --- > > Key: SPARK-13008 > URL: https://issues.apache.org/jira/browse/SPARK-13008 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Trivial > Fix For: 2.0.0 > > > This is to fix a long-time annoyance: Whenever we add a new algorithm to > pyspark.ml, we have to add it to the {{__all__}} list at the top. Since we > keep it alphabetized, it often creates a lot more changes than needed. It is > also easy to add the Estimator and forget the Model. I'm going to switch it > to have one algorithm per line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175052#comment-15175052 ] Varadharajan commented on SPARK-13393: -- [~adrian-wang] Thanks a lot :) > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175045#comment-15175045 ] Jeff Zhang commented on SPARK-13587: spark.pyspark.virtualenv.requirements is a local file (which would be distributed to all nodes) Regarding upgrade these to first-class citizens, I would be conservative for that. Needs more feedback from other users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13612) Multiplication of BigDecimal columns not working as expected
Varadharajan created SPARK-13612: Summary: Multiplication of BigDecimal columns not working as expected Key: SPARK-13612 URL: https://issues.apache.org/jira/browse/SPARK-13612 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Varadharajan Please consider the below snippet: {code} case class AM(id: Int, a: BigDecimal) case class AX(id: Int, b: BigDecimal) val x = sc.parallelize(List(AM(1, 10))).toDF val y = sc.parallelize(List(AX(1, 10))).toDF x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show {code} output: {code} | id| a| id| b| z| | 1|10.00...| 1|10.00...|null| {code} Here the multiplication of the columns ("z") return null instead of 100. As of now we are using the below workaround, but definitely looks like a serious issue. {code} x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / y("b"))).show {code} {code} | id| a| id| b| z| | 1|10.00...| 1|10.00...|100.0...| {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175034#comment-15175034 ] Adrian Wang commented on SPARK-13393: - [~srinathsmn] I have identified the issue, and working on this. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175033#comment-15175033 ] Varadharajan commented on SPARK-13393: -- [~rxin] [~marmbrus] Can you share some inputs on this? > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175018#comment-15175018 ] Mike Sukmanowsky commented on SPARK-13587: -- One thought that just occurred to me, does {{spark.pyspark.virtualenv.requirements}} point to a path on the master node for a requirements file? It'd make sense if that was the case and then the requirements file was shipped to other nodes instead of assuming that this file existed on all Spark nodes at the same location. Also might be a good idea to upgrade these to first-class citizens of spark-submit by supporting them as optional params instead of config properties. I'd go so far as to say it makes sense to deprecate {{--py-files}} in favour of: * {{--py-venv-type=conda}} * {{--py-venv-bin=/path/to/conda}} * {{--py-venv-requirements=/local/path/to/requirements.txt}} > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13609) Support Column Pruning for MapPartitions
[ https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13609: Assignee: Apache Spark > Support Column Pruning for MapPartitions > > > Key: SPARK-13609 > URL: https://issues.apache.org/jira/browse/SPARK-13609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {code} > case class OtherTuple(_1: String, _2: Int) > val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS() > ds.as[OtherTuple].map(identity[OtherTuple]).explain(true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13609) Support Column Pruning for MapPartitions
[ https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13609: Assignee: (was: Apache Spark) > Support Column Pruning for MapPartitions > > > Key: SPARK-13609 > URL: https://issues.apache.org/jira/browse/SPARK-13609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {code} > case class OtherTuple(_1: String, _2: Int) > val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS() > ds.as[OtherTuple].map(identity[OtherTuple]).explain(true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13609) Support Column Pruning for MapPartitions
[ https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175011#comment-15175011 ] Apache Spark commented on SPARK-13609: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11460 > Support Column Pruning for MapPartitions > > > Key: SPARK-13609 > URL: https://issues.apache.org/jira/browse/SPARK-13609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {code} > case class OtherTuple(_1: String, _2: Int) > val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS() > ds.as[OtherTuple].map(identity[OtherTuple]).explain(true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13611) import Aggregator doesn't work in Spark Shell
Wenchen Fan created SPARK-13611: --- Summary: import Aggregator doesn't work in Spark Shell Key: SPARK-13611 URL: https://issues.apache.org/jira/browse/SPARK-13611 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan {code} scala> import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql.expressions.Aggregator scala> class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable { | val numeric = implicitly[Numeric[N]] | override def zero: N = numeric.zero | override def reduce(b: N, a: I): N = numeric.plus(b, f(a)) | override def merge(b1: N,b2: N): N = numeric.plus(b1, b2) | override def finish(reduction: N): N = reduction | } :10: error: not found: type Aggregator class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable { ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175010#comment-15175010 ] Mike Sukmanowsky commented on SPARK-13587: -- Gotcha. I might suggest {{spark.pyspark.virtualenv.bin.path}} in that case. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175006#comment-15175006 ] Gayathri Murali commented on SPARK-13025: - https://github.com/apache/spark/pull/11459 > Allow user to specify the initial model when training LogisticRegression > > > Key: SPARK-13025 > URL: https://issues.apache.org/jira/browse/SPARK-13025 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > > Allow the user to set the initial model when training for logistic > regression. Note the method already exists, just change visibility to public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175003#comment-15175003 ] Jeff Zhang commented on SPARK-13587: Thanks for your feedback [~msukmanowsky]. spark.pyspark.virtualenv.path is not the path where the virtualenv created, it is the path to the executable file for virtualenv/conda which is used for creating virtualenv ( I need to rename it to a more proper name to avoid confusing). In my POC, I will create virtualenv in all the executors not only driver. As you said, some python packages depends on C library, we can not guarantee it would work if we compile it in driver and distribute it to other nodes. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13610) Create a Transformer to disassemble vectors in DataFrames
Andrew MacKinlay created SPARK-13610: Summary: Create a Transformer to disassemble vectors in DataFrames Key: SPARK-13610 URL: https://issues.apache.org/jira/browse/SPARK-13610 Project: Spark Issue Type: New Feature Components: ML, SQL Affects Versions: 1.6.0 Reporter: Andrew MacKinlay Priority: Minor It is possible to convert a standalone numeric field into a single-item Vector, using VectorAssembler. However the inverse operation of retrieving a single item from a vector and translating it into a field doesn't appear to be possible. The workaround I've found is to leave the raw field value in the DF, but I have found no other ways to get a field out of a vector (eg to perform arithmetic on it). Happy to be proved wrong though. Creating a user-defined function doesn't work (in Python at least; it gets a pickleexception).This seems like a simple operation which should be supported for various use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13609) Support Column Pruning for MapPartitions
Xiao Li created SPARK-13609: --- Summary: Support Column Pruning for MapPartitions Key: SPARK-13609 URL: https://issues.apache.org/jira/browse/SPARK-13609 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li {code} case class OtherTuple(_1: String, _2: Int) val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS() ds.as[OtherTuple].map(identity[OtherTuple]).explain(true) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228 ] Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:17 AM: This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark. {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(flag to enable virtualenv) * spark.pyspark.virtualenv.type (native/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable file for for virtualenv/conda which is used for creating virtualenv) Comments and feedback are welcome about how to improve it and whether it's valuable for users. was (Author: zjffdu): This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark. {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable file for for virtualenv/conda which is used for creating virutalenv) Comments and feedback are welcome about how to improve it and whether it's valuable for users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class ja
Avatar Zhang created SPARK-13608: Summary: py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not exist Key: SPARK-13608 URL: https://issues.apache.org/jira/browse/SPARK-13608 Project: Spark Issue Type: Bug Reporter: Avatar Zhang py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228 ] Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:12 AM: This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark. {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable file for for virtualenv/conda which is used for creating virutalenv) Comments and feedback are welcome about how to improve it and whether it's valuable for users. was (Author: zjffdu): This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is the yarn container working directory). So that means user don't need to manually install packages on each node (sometimes you even can't install packages on cluster due to security reason). This is the biggest benefit and purpose that user can create virtualenv on demand without touching each node even when you are not administrator. The cons is the extra cost for installing the required packages before starting python worker. But if it is an application which will run for several hours then the extra cost can be ignored. I have implemented POC for this features. Here's one simple command for how to use virtualenv in pyspark. {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 properties needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174996#comment-15174996 ] Mike Sukmanowsky commented on SPARK-13587: -- Thanks for letting me know about this [~jeffzhang]. I think in general, I'm +1 on the proposal. virtualenvs are the way to go to install requirements and ensure isolation of dependencies between multiple driver scripts. As you noted though, installing hefty requirements like pandas or numpy (assuming you aren't using Conda), would add a pretty significant overhead to startup which could be amortized if the driver was assumed to run for a long enough period of time. Conda of course would pretty well eliminate that problem as it provides pre-compiled binaries for most OSs. I'd like to offer [PEX|https://pex.readthedocs.org/en/stable/] as an alternative, where spark-submit would build a self-contained virtualenv in a .pex file on the Spark master node and then distribute to all other nodes. However, it turns out PEX doesn't support editable requirements and introduces an assumption that all nodes in a cluster are homogenous so that a Python package with C extensions compiled on the master node would run on worker nodes without issue. The latter assumption may be a leap too far for all Spark users. One thing I'm not entirely sure of is the need for the spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why would this path ever be specified? Wouldn't a temporary path be used and subsequently removed after the Python worker completes? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gayathri Murali updated SPARK-13025: Comment: was deleted (was: PR : https://github.com/apache/spark/pull/11458) > Allow user to specify the initial model when training LogisticRegression > > > Key: SPARK-13025 > URL: https://issues.apache.org/jira/browse/SPARK-13025 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > > Allow the user to set the initial model when training for logistic > regression. Note the method already exists, just change visibility to public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174983#comment-15174983 ] Gayathri Murali commented on SPARK-13025: - PR : https://github.com/apache/spark/pull/11458 > Allow user to specify the initial model when training LogisticRegression > > > Key: SPARK-13025 > URL: https://issues.apache.org/jira/browse/SPARK-13025 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > > Allow the user to set the initial model when training for logistic > regression. Note the method already exists, just change visibility to public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding
[ https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174953#comment-15174953 ] Avatar Zhang commented on SPARK-13606: -- /usr/local/bin/python2.7 can launch normally. [root@iZ28x4dqt1oZ ~]# python2.7 Python 2.7.11 (default, Mar 2 2016, 10:20:14) [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> > Error from python worker: /usr/local/bin/python2.7: undefined symbol: > _PyCodec_LookupTextEncoding > --- > > Key: SPARK-13606 > URL: https://issues.apache.org/jira/browse/SPARK-13606 > Project: Spark > Issue Type: Bug >Reporter: Avatar Zhang > > Error from python worker: > /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: > undefined symbol: _PyCodec_LookupTextEncoding > PYTHONPATH was: > > /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174945#comment-15174945 ] Jeff Zhang commented on SPARK-6764: --- [~msukmanowsky] Can SPARK-13587 solve your issue ? I am working on it, welcome any comments. > Add wheel package support for PySpark > - > > Key: SPARK-6764 > URL: https://issues.apache.org/jira/browse/SPARK-6764 > Project: Spark > Issue Type: Improvement > Components: Deploy, PySpark >Reporter: Takao Magoori >Priority: Minor > Labels: newbie > > We can do _spark-submit_ with one or more Python packages (.egg,.zip and > .jar) by *--py-files* option. > h4. zip packaging > Spark put a zip file on its working directory and adds the absolute path to > Python's sys.path. When the user program imports it, > [zipimport|https://docs.python.org/2.7/library/zipimport.html] is > automatically invoked under the hood. That is, data-files and dynamic > modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and > .pyo. > h4. egg packaging > Spark put an egg file on its working directory and adds the absolute path to > Python's sys.path. Unlike zipimport, egg can handle data files and dynamid > modules as far as the author of the package uses [pkg_resources > API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations] > properly. But so many python modules does not use pkg_resources API, that > causes "ImportError"or "No such file" error. Moreover, creating eggs of > dependencies and further dependencies are troublesome job. > h4. wheel packaging > Supporting new Python standard package-format > "[wheel|https://wheel.readthedocs.org/en/latest/]"; would be nice. With wheel, > we can do spark-submit with complex dependencies simply as follows. > 1. Write requirements.txt file. > {noformat} > SQLAlchemy > MySQL-python > requests > simplejson>=3.6.0,<=3.6.5 > pydoop > {noformat} > 2. Do wheel packaging by only one command. All dependencies are wheel-ed. > {noformat} > $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement > requirements.txt > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py > {noformat} > If your pyspark driver is a package which consists of many modules, > 1. Write setup.py for your pyspark driver package. > {noformat} > from setuptools import ( > find_packages, > setup, > ) > setup( > name='yourpkg', > version='0.0.1', > packages=find_packages(), > install_requires=[ > 'SQLAlchemy', > 'MySQL-python', > 'requests', > 'simplejson>=3.6.0,<=3.6.5', > 'pydoop', > ], > ) > {noformat} > 2. Do wheel packaging by only one command. Your driver package and all > dependencies are wheel-ed. > {noformat} > your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/. > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') > your_driver_bootstrap.py > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding
[ https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174941#comment-15174941 ] Jeff Zhang commented on SPARK-13606: This might be python environment issue. Can you launch python on that machine manually ? > Error from python worker: /usr/local/bin/python2.7: undefined symbol: > _PyCodec_LookupTextEncoding > --- > > Key: SPARK-13606 > URL: https://issues.apache.org/jira/browse/SPARK-13606 > Project: Spark > Issue Type: Bug >Reporter: Avatar Zhang > > Error from python worker: > /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: > undefined symbol: _PyCodec_LookupTextEncoding > PYTHONPATH was: > > /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174943#comment-15174943 ] Gayathri Murali commented on SPARK-13073: - I can work on this, can you please assign it to me? > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure
Takeshi Yamamuro created SPARK-13607: Summary: Improves compression performance for integer-typed values on cache to reduce GC pressure Key: SPARK-13607 URL: https://issues.apache.org/jira/browse/SPARK-13607 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13581) LibSVM throws MatchError
[ https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-13581: --- Priority: Critical (was: Minor) > LibSVM throws MatchError > > > Key: SPARK-13581 > URL: https://issues.apache.org/jira/browse/SPARK-13581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jakob Odersky >Assignee: Jeff Zhang >Priority: Critical > > When running an action on a DataFrame obtained by reading from a libsvm file > a MatchError is thrown, however doing the same on a cached DataFrame works > fine. > {code} > val df = > sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") > //file is in spark repository > df.select(df("features")).show() //MatchError > df.cache() > df.select(df("features")).show() //OK > {code} > The exception stack trace is the following: > {code} > scala.MatchError: 1.0 (of class java.lang.Double) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207) > [info]at > org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) > [info]at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) > [info]at > org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56) > {code} > This issue first appeared in commit {{1dac964c1}}, in PR > [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622. > [~jeffzhang], do you have any insight of what could be going on? > cc [~iyounus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13141) Dataframe created from Hive partitioned tables using HiveContext returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13141. Resolution: Not A Problem Hi, this was a bug in CDH 5.5.0/5.5.1, it was fixed in CDH 5.5.2. Sorry about the trouble. > Dataframe created from Hive partitioned tables using HiveContext returns > wrong results > -- > > Key: SPARK-13141 > URL: https://issues.apache.org/jira/browse/SPARK-13141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.1 >Reporter: Simone >Priority: Critical > > I get wrong dataframe results using HiveContext with Spark 1.5.0 on CDH 5.5.1 > in yarn-client mode. > The problem occurs with partitioned tables on text delimited HDFS data, both > with Scala and Python. > This an example code: > import org.apache.spark.sql.hive.HiveContext > val hc = new HiveContext(sc) > hc.table("my_db.partition_table").show() > The result is that all values of all rows are NULL, except from the first > column (that contains the whole line of data) and the partitioning columns, > which appears to be correct. > With Hive and Impala I get correct results. > Also with Spark on the same data with a not partitioned table I get correct > results. > I think that similar problems occurs also with Avro data: > https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Pyspark-Table-Dataframe-returning-empty-records-from-Partitioned/td-p/35836 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13511) Add wholestage codegen for limit
[ https://issues.apache.org/jira/browse/SPARK-13511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174919#comment-15174919 ] Liang-Chi Hsieh commented on SPARK-13511: - [~davies] Can you help update the Assignee field? Thanks! > Add wholestage codegen for limit > > > Key: SPARK-13511 > URL: https://issues.apache.org/jira/browse/SPARK-13511 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Current limit operator doesn't support wholestage codegen. This issue is open > to add support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13174) Add API and options for csv data sources
[ https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174912#comment-15174912 ] Apache Spark commented on SPARK-13174: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/11457 > Add API and options for csv data sources > > > Key: SPARK-13174 > URL: https://issues.apache.org/jira/browse/SPARK-13174 > Project: Spark > Issue Type: New Feature > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Davies Liu > > We should have a API to load csv data source (with some options as > arguments), similar to json() and jdbc() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13174) Add API and options for csv data sources
[ https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13174: Assignee: (was: Apache Spark) > Add API and options for csv data sources > > > Key: SPARK-13174 > URL: https://issues.apache.org/jira/browse/SPARK-13174 > Project: Spark > Issue Type: New Feature > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Davies Liu > > We should have a API to load csv data source (with some options as > arguments), similar to json() and jdbc() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13174) Add API and options for csv data sources
[ https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13174: Assignee: Apache Spark > Add API and options for csv data sources > > > Key: SPARK-13174 > URL: https://issues.apache.org/jira/browse/SPARK-13174 > Project: Spark > Issue Type: New Feature > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Apache Spark > > We should have a API to load csv data source (with some options as > arguments), similar to json() and jdbc() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding
Avatar Zhang created SPARK-13606: Summary: Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding Key: SPARK-13606 URL: https://issues.apache.org/jira/browse/SPARK-13606 Project: Spark Issue Type: Bug Reporter: Avatar Zhang Error from python worker: /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: undefined symbol: _PyCodec_LookupTextEncoding PYTHONPATH was: /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13141) Dataframe created from Hive partitioned tables using HiveContext returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174895#comment-15174895 ] zhichao-li edited comment on SPARK-13141 at 3/2/16 2:29 AM: Just try, but this cannot be reproduced from the master version by : create table mn.logs (field1 string, field2 string, field3 string) partitioned by (year string, month string , day string, host string) row format delimited fields terminated by ','; insert into logs partition (year="2013", month="07", day="28", host="host1") values ("foo","foo","foo") hc.table("logs").show() as you mentioned, not sure if it's specific to the version of CDH 5.5.1 was (Author: zhichao-li): Just try, but this cannot be reproduced from the master version by the sql: `create table mn.logs (field1 string, field2 string, field3 string) partitioned by (year string, month string , day string, host string) row format delimited fields terminated by ',';` as you mentioned, not sure if it's specific to the version of CDH 5.5.1 > Dataframe created from Hive partitioned tables using HiveContext returns > wrong results > -- > > Key: SPARK-13141 > URL: https://issues.apache.org/jira/browse/SPARK-13141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.1 >Reporter: Simone >Priority: Critical > > I get wrong dataframe results using HiveContext with Spark 1.5.0 on CDH 5.5.1 > in yarn-client mode. > The problem occurs with partitioned tables on text delimited HDFS data, both > with Scala and Python. > This an example code: > import org.apache.spark.sql.hive.HiveContext > val hc = new HiveContext(sc) > hc.table("my_db.partition_table").show() > The result is that all values of all rows are NULL, except from the first > column (that contains the whole line of data) and the partitioning columns, > which appears to be correct. > With Hive and Impala I get correct results. > Also with Spark on the same data with a not partitioned table I get correct > results. > I think that similar problems occurs also with Avro data: > https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Pyspark-Table-Dataframe-returning-empty-records-from-Partitioned/td-p/35836 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174899#comment-15174899 ] Mike Sukmanowsky commented on SPARK-6764: - Just bumping this issue up. We use Spark (PySpark) pretty extensively and would love the ability to use wheels in addition to eggs with spark-submit. > Add wheel package support for PySpark > - > > Key: SPARK-6764 > URL: https://issues.apache.org/jira/browse/SPARK-6764 > Project: Spark > Issue Type: Improvement > Components: Deploy, PySpark >Reporter: Takao Magoori >Priority: Minor > Labels: newbie > > We can do _spark-submit_ with one or more Python packages (.egg,.zip and > .jar) by *--py-files* option. > h4. zip packaging > Spark put a zip file on its working directory and adds the absolute path to > Python's sys.path. When the user program imports it, > [zipimport|https://docs.python.org/2.7/library/zipimport.html] is > automatically invoked under the hood. That is, data-files and dynamic > modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and > .pyo. > h4. egg packaging > Spark put an egg file on its working directory and adds the absolute path to > Python's sys.path. Unlike zipimport, egg can handle data files and dynamid > modules as far as the author of the package uses [pkg_resources > API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations] > properly. But so many python modules does not use pkg_resources API, that > causes "ImportError"or "No such file" error. Moreover, creating eggs of > dependencies and further dependencies are troublesome job. > h4. wheel packaging > Supporting new Python standard package-format > "[wheel|https://wheel.readthedocs.org/en/latest/]"; would be nice. With wheel, > we can do spark-submit with complex dependencies simply as follows. > 1. Write requirements.txt file. > {noformat} > SQLAlchemy > MySQL-python > requests > simplejson>=3.6.0,<=3.6.5 > pydoop > {noformat} > 2. Do wheel packaging by only one command. All dependencies are wheel-ed. > {noformat} > $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement > requirements.txt > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py > {noformat} > If your pyspark driver is a package which consists of many modules, > 1. Write setup.py for your pyspark driver package. > {noformat} > from setuptools import ( > find_packages, > setup, > ) > setup( > name='yourpkg', > version='0.0.1', > packages=find_packages(), > install_requires=[ > 'SQLAlchemy', > 'MySQL-python', > 'requests', > 'simplejson>=3.6.0,<=3.6.5', > 'pydoop', > ], > ) > {noformat} > 2. Do wheel packaging by only one command. Your driver package and all > dependencies are wheel-ed. > {noformat} > your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/. > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') > your_driver_bootstrap.py > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13141) Dataframe created from Hive partitioned tables using HiveContext returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174895#comment-15174895 ] zhichao-li commented on SPARK-13141: Just try, but this cannot be reproduced from the master version by the sql: `create table mn.logs (field1 string, field2 string, field3 string) partitioned by (year string, month string , day string, host string) row format delimited fields terminated by ',';` as you mentioned, not sure if it's specific to the version of CDH 5.5.1 > Dataframe created from Hive partitioned tables using HiveContext returns > wrong results > -- > > Key: SPARK-13141 > URL: https://issues.apache.org/jira/browse/SPARK-13141 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.1 >Reporter: Simone >Priority: Critical > > I get wrong dataframe results using HiveContext with Spark 1.5.0 on CDH 5.5.1 > in yarn-client mode. > The problem occurs with partitioned tables on text delimited HDFS data, both > with Scala and Python. > This an example code: > import org.apache.spark.sql.hive.HiveContext > val hc = new HiveContext(sc) > hc.table("my_db.partition_table").show() > The result is that all values of all rows are NULL, except from the first > column (that contains the whole line of data) and the partitioning columns, > which appears to be correct. > With Hive and Impala I get correct results. > Also with Spark on the same data with a not partitioned table I get correct > results. > I think that similar problems occurs also with Avro data: > https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Pyspark-Table-Dataframe-returning-empty-records-from-Partitioned/td-p/35836 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174856#comment-15174856 ] Mark Grover commented on SPARK-12177: - Hi [~tdas] and [~rxin], can you help us with your opinion on these questions, so we can unblock this work: 1. Should we support both Kafka 0.8 and 0.9 or just 0.9? The pros and cons are listed [here|https://github.com/apache/spark/pull/11143#issuecomment-182154267] along with what other projects are doing. 2. Should we make a separate project for the implementation using the new kafka consumer API with the same class names (e.g. KafkaRDD, etc.), or create new classes like hadoop did, in the same subproject e.g. NewKafkaRDD, etc. Thanks! > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169380#comment-15169380 ] Randall Whitman edited comment on SPARK-7768 at 3/2/16 1:47 AM: Am I missing something? As far as I can see, the @DeveloperApi annotation is still present on class UserDefinedType - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala @Experimental annotation is still present on class UserDefinedFunction https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala (corrected as noted by Jaka Jancar). Also, I have not seen any mention of having addressed the design issue of using @SQLUserDefinedType with third-party libraries, that is discussed in this JIRA, 2015/05/21 through 2015/06/12. was (Author: randallwhitman): Am I missing something? As far as I can see, the @Experimental annotation is still present on class UserDefinedFunction - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala Also, I have not seen any mention of having addressed the design issue of using @SQLUserDefinedType with third-party libraries, that is discussed in this JIRA, 2015/05/21 through 2015/06/12. > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.
[ https://issues.apache.org/jira/browse/SPARK-13167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13167. - Resolution: Fixed Assignee: Suresh Thalamati Fix Version/s: 2.0.0 > JDBC data source does not include null value partition columns rows in the > result. > -- > > Key: SPARK-13167 > URL: https://issues.apache.org/jira/browse/SPARK-13167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Suresh Thalamati >Assignee: Suresh Thalamati > Fix For: 2.0.0 > > > Reading from am JDBC data source using a partition column that is nullable > can return incorrect number of rows, if there are rows with null value for > partition column. > {code} > val emp = > sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", > "TEST.EMP", "theid", 0, 4, 3, new Properties) > emp.count() > {code} > Above jdbc read call sets up the partitions of the following form. It does > not include null predicate. > {code} > JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < > 2,1),JDBCPartition(THEID >= 2,2) > {code} > Rows with null values in partition column are not included in the results > because the partition predicate does not specify is null predicates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174809#comment-15174809 ] Łukasz Gieroń commented on SPARK-13230: --- [~srowen] Can you please assign me to this ticket? I have a pretty strong suspicion as to what is going on here, but would like to confirm it with scala library folks first before I speak here. > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issue
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174807#comment-15174807 ] Jaka Jancar commented on SPARK-7768: [~randallwhitman] UDT, not UDF: https://github.com/apache/spark/blob/v1.6.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13598) Remove LeftSemiJoinBNL
[ https://issues.apache.org/jira/browse/SPARK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13598. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11448 [https://github.com/apache/spark/pull/11448] > Remove LeftSemiJoinBNL > -- > > Key: SPARK-13598 > URL: https://issues.apache.org/jira/browse/SPARK-13598 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu > Fix For: 2.0.0 > > > Broadcast left semi join without joining keys is already supported in > BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, > we should remove that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns
Steven Lewis created SPARK-13605: Summary: Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns Key: SPARK-13605 URL: https://issues.apache.org/jira/browse/SPARK-13605 Project: Spark Issue Type: New Feature Components: Java API Affects Versions: 1.6.0 Environment: Any Reporter: Steven Lewis Fix For: 1.6.0 in the current environment the only way to turn a List or JavaRDD into a DataSet with columns is to use a Encoders.bean(MyBean.class); The current implementation fails if a Bean property is not a basic type or a Bean. I would like to see one of the following 1) Default to JavaSerialization for any Java Object implementing Serializable when using bean Encoder 2) Allow an encoder which is a Map and look up entries in encoding classes - an ideal implementation would look for the class then any interfaces and then search base classes The following code illustrates the issue /** * This class is a good Java bean but one field holds an object * which is not a bean */ public class MyBean implements Serializable { private int m_count; private String m_Name; private MyUnBean m_UnBean; public MyBean(int count, String name, MyUnBean unBean) { m_count = count; m_Name = name; m_UnBean = unBean; } public int getCount() {return m_count; } public void setCount(int count) {m_count = count;} public String getName() {return m_Name;} public void setName(String name) {m_Name = name;} public MyUnBean getUnBean() {return m_UnBean;} public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;} } /** * This is a Java object which is not a bean * no getters or setters but is serializable */ public class MyUnBean implements Serializable { public final int count; public final String name; public MyUnBean(int count, String name) { this.count = count; this.name = name; } } ** * This code creates a list of objects containing MyBean - * a Java Bean containing one field which is not bean * It then attempts and fails to use a bean encoder * to make a DataSet */ public class DatasetTest { public static final Random RND = new Random(); public static final int LIST_SIZE = 100; public static String makeName() { return Integer.toString(RND.nextInt()); } public static MyUnBean makeUnBean() { return new MyUnBean(RND.nextInt(), makeName()); } public static MyBean makeBean() { return new MyBean(RND.nextInt(), makeName(), makeUnBean()); } /** * Make a list of MyBeans * @return */ public static List makeBeanList() { List holder = new ArrayList(); for (int i = 0; i < LIST_SIZE; i++) { holder.add(makeBean()); } return holder; } public static SQLContext getSqlContext() { SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("BeanTest") ; Option option = sparkConf.getOption("spark.master"); if (!option.isDefined())// use local over nothing sparkConf.setMaster("local[*]"); JavaSparkContext ctx = new JavaSparkContext(sparkConf) ; return new SQLContext(ctx); } public static void main(String[] args) { SQLContext sqlContext = getSqlContext(); Encoder evidence = Encoders.bean(MyBean.class); Encoder evidence2 = Encoders.javaSerialization(MyUnBean.class); List holder = makeBeanList(); // fails at this line with // Exception in thread "main" java.lang.UnsupportedOperationException: no encoder found for com.lordjoe.testing.MyUnBean Dataset beanSet = sqlContext.createDataset( holder, evidence); long count = beanSet.count(); if(count != LIST_SIZE) throw new IllegalStateException("bad count"); } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage
[ https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174726#comment-15174726 ] Sun Rui commented on SPARK-13573: - [~chipsenkbeil] glad to know Toree is to support SparkR. I tried it and can't figure out how to interact with SparkR. Could you describe how Toree uses the methods to provide interaction with SparkR? > Open SparkR APIs (R package) to allow better 3rd party usage > > > Key: SPARK-13573 > URL: https://issues.apache.org/jira/browse/SPARK-13573 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Chip Senkbeil > > Currently, SparkR's R package does not expose enough of its APIs to be used > flexibly. That I am aware of, SparkR still requires you to create a new > SparkContext by invoking the sparkR.init method (so you cannot connect to a > running one) and there is no way to invoke custom Java methods using the > exposed SparkR API (unlike PySpark). > We currently maintain a fork of SparkR that is used to power the R > implementation of Apache Toree, which is a gateway to use Apache Spark. This > fork provides a connect method (to use an existing Spark Context), exposes > needed methods like invokeJava (to be able to communicate with our JVM to > retrieve code to run, etc), and uses reflection to access > org.apache.spark.api.r.RBackend. > Here is the documentation I recorded regarding changes we need to enable > SparkR as an option for Apache Toree: > https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13604) Sync worker's state after registering with master
[ https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13604: - Description: If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. We should SPARK-13604 was: If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. We should > Sync worker's state after registering with master > - > > Key: SPARK-13604 > URL: https://issues.apache.org/jira/browse/SPARK-13604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If Master cannot talk with Worker for a while and then network is back, > Worker may leak existing executors and drivers. We should SPARK-13604 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13604) Sync worker's state after registering with master
[ https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13604: Assignee: Apache Spark (was: Shixiong Zhu) > Sync worker's state after registering with master > - > > Key: SPARK-13604 > URL: https://issues.apache.org/jira/browse/SPARK-13604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > If Master cannot talk with Worker for a while and then network is back, > Worker may leak existing executors and drivers. We should sync worker's state > after registering with master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13604) Sync worker's state after registering with master
[ https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13604: - Description: If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. We should sync worker's state after registering with master. was: If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. We should SPARK-13604 > Sync worker's state after registering with master > - > > Key: SPARK-13604 > URL: https://issues.apache.org/jira/browse/SPARK-13604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If Master cannot talk with Worker for a while and then network is back, > Worker may leak existing executors and drivers. We should sync worker's state > after registering with master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13604) Sync worker's state after registering with master
[ https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174724#comment-15174724 ] Apache Spark commented on SPARK-13604: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11455 > Sync worker's state after registering with master > - > > Key: SPARK-13604 > URL: https://issues.apache.org/jira/browse/SPARK-13604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If Master cannot talk with Worker for a while and then network is back, > Worker may leak existing executors and drivers. We should sync worker's state > after registering with master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jeanlyn closed SPARK-13586. --- Resolution: Invalid > add config to skip generate down time batch when restart StreamingContext > - > > Key: SPARK-13586 > URL: https://issues.apache.org/jira/browse/SPARK-13586 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: jeanlyn >Priority: Minor > > If we restart streaming, which using checkpoint and has stopped for hours, it > will generate a lot of batch to the queue, and it need to take a while to > handle this batches. So i propose to add a config to control whether generate > down time batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13604) Sync worker's state after registering with master
[ https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13604: - Description: If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. We should was: If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. > Sync worker's state after registering with master > - > > Key: SPARK-13604 > URL: https://issues.apache.org/jira/browse/SPARK-13604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If Master cannot talk with Worker for a while and then network is back, > Worker may leak existing executors and drivers. We should -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13604) Sync worker's state after registering with master
[ https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13604: Assignee: Shixiong Zhu (was: Apache Spark) > Sync worker's state after registering with master > - > > Key: SPARK-13604 > URL: https://issues.apache.org/jira/browse/SPARK-13604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > If Master cannot talk with Worker for a while and then network is back, > Worker may leak existing executors and drivers. We should sync worker's state > after registering with master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13604) Sync worker's state after registering with master
Shixiong Zhu created SPARK-13604: Summary: Sync worker's state after registering with master Key: SPARK-13604 URL: https://issues.apache.org/jira/browse/SPARK-13604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu If Master cannot talk with Worker for a while and then network is back, Worker may leak existing executors and drivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174720#comment-15174720 ] Sun Rui commented on SPARK-13525: - the interactive R session is for your driver, Rscript is needed for launching R workers. > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.s
[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174691#comment-15174691 ] Joseph K. Bradley commented on SPARK-13073: --- It sounds reasonable to provide the same printed summary in Scala, Java, and Python as in R. Perhaps it can be provided as a toString method for the LogisticRegressionModel.summary member? > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174685#comment-15174685 ] Joseph K. Bradley commented on SPARK-13030: --- I agree this is an issue, but I think we need to keep the same number of categories between training & test. A reasonable fix might be to add an option for creating an additional "unknown" bucket during training, and putting all new categories into this bucket during testing. > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13574) Improve parquet dictionary decoding for strings
[ https://issues.apache.org/jira/browse/SPARK-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174656#comment-15174656 ] Apache Spark commented on SPARK-13574: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/11454 > Improve parquet dictionary decoding for strings > --- > > Key: SPARK-13574 > URL: https://issues.apache.org/jira/browse/SPARK-13574 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Priority: Minor > > Currently, the parquet reader will copy the dictionary value for each data > value. This is bad for string columns as we explode the dictionary during > decode. We should instead, have the data values point to the safe backing > memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13603) SQL generation for subquery
[ https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13603: Assignee: Davies Liu (was: Apache Spark) > SQL generation for subquery > --- > > Key: SPARK-13603 > URL: https://issues.apache.org/jira/browse/SPARK-13603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Generate SQL for subquery expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13603) SQL generation for subquery
[ https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174649#comment-15174649 ] Apache Spark commented on SPARK-13603: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11453 > SQL generation for subquery > --- > > Key: SPARK-13603 > URL: https://issues.apache.org/jira/browse/SPARK-13603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Generate SQL for subquery expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13603) SQL generation for subquery
[ https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13603: Assignee: Apache Spark (was: Davies Liu) > SQL generation for subquery > --- > > Key: SPARK-13603 > URL: https://issues.apache.org/jira/browse/SPARK-13603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Generate SQL for subquery expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13603) SQL generation for subquery
Davies Liu created SPARK-13603: -- Summary: SQL generation for subquery Key: SPARK-13603 URL: https://issues.apache.org/jira/browse/SPARK-13603 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Davies Liu Generate SQL for subquery expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174642#comment-15174642 ] Reynold Xin commented on SPARK-13596: - Are those dot files even possible to move? > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13548) Move tags and unsafe modules into common
[ https://issues.apache.org/jira/browse/SPARK-13548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13548. - Resolution: Fixed Fix Version/s: 2/ > Move tags and unsafe modules into common > > > Key: SPARK-13548 > URL: https://issues.apache.org/jira/browse/SPARK-13548 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2/ > > > Similar to SPARK-13529, this removes two top level directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13548) Move tags and unsafe modules into common
[ https://issues.apache.org/jira/browse/SPARK-13548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13548: Fix Version/s: (was: 2/) 2.0.0 > Move tags and unsafe modules into common > > > Key: SPARK-13548 > URL: https://issues.apache.org/jira/browse/SPARK-13548 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Similar to SPARK-13529, this removes two top level directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org