[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175204#comment-15175204
 ] 

Xiao Li commented on SPARK-13337:
-

To get your results, try using left outer join + right out join + union 
distinct. : )

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?

2016-03-01 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Description: 
hot.count()=599147
ghot.size=21844


[bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
--driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:mysql://:/?user==","dbtable" -> "")).load()
Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
hot: org.apache.spark.sql.DataFrame = []

scala> val ghot=hot.groupBy("Num","pNum").count().collect()
Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
scala> ghot.take(20)
res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])

scala> hot.groupBy("Num","pNum").count().show()
Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 4194304 
bytes, TID = 202
+--+-+-+
| QQNum| TroopNum|count|
+--+-+-+
|1X|38XXX|1|
|1X| 5XXX|2|
|1X|26XXX|6|
|1X|14XXX|3|
|1X|41XXX|   14|
|1X|48XXX|   18|
|1X|23XXX|2|
|1X|  XXX|   34|
|1X|52XXX|1|
|1X|52XXX|2|
|1X|49XXX|3|
|1X|42XXX|3|
|1X|17XXX|   11|
|1X|25XXX|  129|
|1X|13XXX|2|
|1X|19XXX|1|
|1X|32XXX|9|
|1X|38XXX|6|
|1X|38XXX|   13|
|1X|30XXX|4|
+--+-+-+
only showing top 20 rows

  was:
[bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
--driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:mysql://:/?user==","dbtable" -> "")).load()
Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
hot: org.apache.spark.sql.DataFrame = []

scala> val ghot=hot.groupBy("Num","pNum").count().collect()
Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 

[jira] [Assigned] (SPARK-13543) Support for specifying compression codec for Parquet/ORC via option()

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13543:


Assignee: (was: Apache Spark)

> Support for specifying compression codec for Parquet/ORC via option()
> -
>
> Key: SPARK-13543
> URL: https://issues.apache.org/jira/browse/SPARK-13543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Likewise, SPARK-12871, SPARK-12872 and SPARK-13503, the compression codec can 
> be set via {{option()}} for Parquet and ORC rather than manually setting them 
> to Hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13543) Support for specifying compression codec for Parquet/ORC via option()

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13543:


Assignee: Apache Spark

> Support for specifying compression codec for Parquet/ORC via option()
> -
>
> Key: SPARK-13543
> URL: https://issues.apache.org/jira/browse/SPARK-13543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Likewise, SPARK-12871, SPARK-12872 and SPARK-13503, the compression codec can 
> be set via {{option()}} for Parquet and ORC rather than manually setting them 
> to Hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13543) Support for specifying compression codec for Parquet/ORC via option()

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175190#comment-15175190
 ] 

Apache Spark commented on SPARK-13543:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11464

> Support for specifying compression codec for Parquet/ORC via option()
> -
>
> Key: SPARK-13543
> URL: https://issues.apache.org/jira/browse/SPARK-13543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Likewise, SPARK-12871, SPARK-12872 and SPARK-13503, the compression codec can 
> be set via {{option()}} for Parquet and ORC rather than manually setting them 
> to Hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?

2016-03-01 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Attachment: memory leak.png

> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user==","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?

2016-03-01 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Attachment: (was: memory leak.png)

> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory.png
>
>
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user==","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13614) show() trigger memory leak,why?

2016-03-01 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Summary: show() trigger memory leak,why?  (was: show() trigger memory leak)

> show() trigger memory leak,why?
> ---
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user==","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> hot: org.apache.spark.sql.DataFrame = []
> scala> val ghot=hot.groupBy("Num","pNum").count().collect()
> Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
> scala> ghot.take(20)
> res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])
> scala> hot.groupBy("Num","pNum").count().show()
> Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by setting useSSL=false, 
> or set useSSL=true and provide truststore for server certificate verification.
> 16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 
> 4194304 bytes, TID = 202
> +--+-+-+
> | QQNum| TroopNum|count|
> +--+-+-+
> |1X|38XXX|1|
> |1X| 5XXX|2|
> |1X|26XXX|6|
> |1X|14XXX|3|
> |1X|41XXX|   14|
> |1X|48XXX|   18|
> |1X|23XXX|2|
> |1X|  XXX|   34|
> |1X|52XXX|1|
> |1X|52XXX|2|
> |1X|49XXX|3|
> |1X|42XXX|3|
> |1X|17XXX|   11|
> |1X|25XXX|  129|
> |1X|13XXX|2|
> |1X|19XXX|1|
> |1X|32XXX|9|
> |1X|38XXX|6|
> |1X|38XXX|   13|
> |1X|30XXX|4|
> +--+-+-+
> only showing top 20 rows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13614) show() trigger memory leak

2016-03-01 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Description: 
[bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
--driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:mysql://:/?user==","dbtable" -> "")).load()
Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
hot: org.apache.spark.sql.DataFrame = []

scala> val ghot=hot.groupBy("Num","pNum").count().collect()
Wed Mar 02 14:22:59 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
ghot: Array[org.apache.spark.sql.Row] = Array([[],[],[], [,42310...
scala> ghot.take(20)
res0: Array[org.apache.spark.sql.Row] = Array([],[],[],[],[],[],[],[])

scala> hot.groupBy("Num","pNum").count().show()
Wed Mar 02 14:26:05 CST 2016 WARN: Establishing SSL connection without server's 
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ 
and 5.7.6+ requirements SSL connection must be established by default if 
explicit option isn't set. For compliance with existing applications not using 
SSL the verifyServerCertificate property is set to 'false'. You need either to 
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide 
truststore for server certificate verification.
16/03/02 14:26:33 ERROR Executor: Managed memory leak detected; size = 4194304 
bytes, TID = 202
+--+-+-+
| QQNum| TroopNum|count|
+--+-+-+
|1X|38XXX|1|
|1X| 5XXX|2|
|1X|26XXX|6|
|1X|14XXX|3|
|1X|41XXX|   14|
|1X|48XXX|   18|
|1X|23XXX|2|
|1X|  XXX|   34|
|1X|52XXX|1|
|1X|52XXX|2|
|1X|49XXX|3|
|1X|42XXX|3|
|1X|17XXX|   11|
|1X|25XXX|  129|
|1X|13XXX|2|
|1X|19XXX|1|
|1X|32XXX|9|
|1X|38XXX|6|
|1X|38XXX|   13|
|1X|30XXX|4|
+--+-+-+
only showing top 20 rows

> show() trigger memory leak
> --
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>
> [bigdata@namenode spark-1.5.2-bin-hadoop2.4]$ bin/spark-shell 
> --driver-class-path /home/bigdata/mysql-connector-java-5.1.38-bin.jar 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val hot=sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:mysql://:/?user==","dbtable" -> "")).load()
> Wed Mar 02 14:22:37 CST 2016 WARN: Establishing SSL connection without 
> server's identity verification is not recommended. According to MySQL 
> 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established 
> by default if explicit option isn't set. For compliance with existing 
> applications not using SSL the verifyServerCertificate property is set to 
> 'false'. You need either to explicitly disable SSL by 

[jira] [Assigned] (SPARK-13613) Provide ignored tests to export test dataset into CSV format

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13613:


Assignee: Apache Spark

> Provide ignored tests to export test dataset into CSV format
> 
>
> Key: SPARK-13613
> URL: https://issues.apache.org/jira/browse/SPARK-13613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Provide ignored test to export the test dataset into CSV format in 
> LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite 
> and GeneralizedLinearRegressionSuite, so users can validate the training 
> accuracy compared with R's glm, glmnet and survival package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175154#comment-15175154
 ] 

Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM:


suppose we are joining two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
|null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case


was (Author: zwang):
suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
|null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13613) Provide ignored tests to export test dataset into CSV format

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13613:


Assignee: (was: Apache Spark)

> Provide ignored tests to export test dataset into CSV format
> 
>
> Key: SPARK-13613
> URL: https://issues.apache.org/jira/browse/SPARK-13613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Provide ignored test to export the test dataset into CSV format in 
> LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite 
> and GeneralizedLinearRegressionSuite, so users can validate the training 
> accuracy compared with R's glm, glmnet and survival package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13613) Provide ignored tests to export test dataset into CSV format

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175156#comment-15175156
 ] 

Apache Spark commented on SPARK-13613:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11463

> Provide ignored tests to export test dataset into CSV format
> 
>
> Key: SPARK-13613
> URL: https://issues.apache.org/jira/browse/SPARK-13613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Provide ignored test to export the test dataset into CSV format in 
> LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite 
> and GeneralizedLinearRegressionSuite, so users can validate the training 
> accuracy compared with R's glm, glmnet and survival package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13614) show() trigger memory leak

2016-03-01 Thread chillon_m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chillon_m updated SPARK-13614:
--
Attachment: memory leak.png
memory.png

> show() trigger memory leak
> --
>
> Key: SPARK-13614
> URL: https://issues.apache.org/jira/browse/SPARK-13614
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: chillon_m
> Attachments: memory leak.png, memory.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175154#comment-15175154
 ] 

Zhong Wang commented on SPARK-13337:


suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175154#comment-15175154
 ] 

Zhong Wang edited comment on SPARK-13337 at 3/2/16 6:50 AM:


suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
|null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case


was (Author: zwang):
suppose we have two tables:
--
TableA
||key1||key2||value1||
|null|k1|v1|
|k2|k3|v2|

TableB
||key1||key2||value2||
|null|k1|v3|
|k4|k5|v4|

The result table I want is:
--
TableC
||key1||key2||value1||value2||
|null|k1|v1|v3|
|k2|k3|v2|null|
|k4|k5|null|v4|

We cannot use the current join-using-columns interface, because it doesn't 
support null-safe joins, and we have null values in the first row

We cannot use join-select with explicit "<=>" neither, because the output table 
will be like:
--
||df1.key1||df1.key2||df2.key1||df2.key2||value1||value2||
|null|k1|null|k1|v1|v3|
|k2|k3|null|null|v2|null|
null|null|k4|k5|null|v4|

it is difficult to get the result like TableC using select cause, because the 
null values from outer join (row 2 & 3) can be in both df1.* columns and df2.* 
columns

Hope this makes sense to you. I'd like to submit a pr if this is a real use case

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13614) show() trigger memory leak

2016-03-01 Thread chillon_m (JIRA)
chillon_m created SPARK-13614:
-

 Summary: show() trigger memory leak
 Key: SPARK-13614
 URL: https://issues.apache.org/jira/browse/SPARK-13614
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.5.2
Reporter: chillon_m






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class jav

2016-03-01 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao closed SPARK-13608.
---
Resolution: Not A Problem

> py4j.Py4JException: Method createDirectStream([class 
> org.apache.spark.streaming.api.java.JavaStreamingContext, class 
> java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does 
> not exist
> -
>
> Key: SPARK-13608
> URL: https://issues.apache.org/jira/browse/SPARK-13608
> Project: Spark
>  Issue Type: Bug
>Reporter: Avatar Zhang
>
> py4j.Py4JException: Method createDirectStream([class 
> org.apache.spark.streaming.api.java.JavaStreamingContext, class 
> java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does 
> not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class

2016-03-01 Thread Avatar Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175148#comment-15175148
 ] 

Avatar Zhang commented on SPARK-13608:
--

i used a bad version spark-streaming-kafka-assembly. thanks. problem resolved

> py4j.Py4JException: Method createDirectStream([class 
> org.apache.spark.streaming.api.java.JavaStreamingContext, class 
> java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does 
> not exist
> -
>
> Key: SPARK-13608
> URL: https://issues.apache.org/jira/browse/SPARK-13608
> Project: Spark
>  Issue Type: Bug
>Reporter: Avatar Zhang
>
> py4j.Py4JException: Method createDirectStream([class 
> org.apache.spark.streaming.api.java.JavaStreamingContext, class 
> java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does 
> not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class

2016-03-01 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175144#comment-15175144
 ] 

Saisai Shao commented on SPARK-13608:
-

Hi [~avatarzhang] , would you please elaborate your problem, how you use this 
API, and do you have a Spark Streaming Kafka assembly jar loaded in the 
environment?

> py4j.Py4JException: Method createDirectStream([class 
> org.apache.spark.streaming.api.java.JavaStreamingContext, class 
> java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does 
> not exist
> -
>
> Key: SPARK-13608
> URL: https://issues.apache.org/jira/browse/SPARK-13608
> Project: Spark
>  Issue Type: Bug
>Reporter: Avatar Zhang
>
> py4j.Py4JException: Method createDirectStream([class 
> org.apache.spark.streaming.api.java.JavaStreamingContext, class 
> java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does 
> not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13613) Provide ignored tests to export test dataset into CSV format

2016-03-01 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-13613:
---

 Summary: Provide ignored tests to export test dataset into CSV 
format
 Key: SPARK-13613
 URL: https://issues.apache.org/jira/browse/SPARK-13613
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Provide ignored test to export the test dataset into CSV format in 
LinearRegressionSuite, LogisticRegressionSuite, AFTSurvivalRegressionSuite and 
GeneralizedLinearRegressionSuite, so users can validate the training accuracy 
compared with R's glm, glmnet and survival package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2016-03-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175107#comment-15175107
 ] 

Xiao Li commented on SPARK-13219:
-

Hi, [~velvia] after a discussion with Michael, he prefers to enhancing the 
existing Constraints for resolving this issue. Will reimplement the whole thing 
based on the new framework. Thanks!

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175097#comment-15175097
 ] 

Xiao Li commented on SPARK-13337:
-

What is the null columns? If you are using full outer joins, all the columns in 
the result sets of joins could be null columns. 

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175094#comment-15175094
 ] 

Apache Spark commented on SPARK-12941:
--

User 'thomastechs' has created a pull request for this issue:
https://github.com/apache/spark/pull/11462

> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>Assignee: Thomas Sebastian
> Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0
>
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175095#comment-15175095
 ] 

Xiao Li commented on SPARK-13393:
-

Thank you! [~adrian-wang] 

Sorry, [~srinathsmn] I missed your reply. 

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage

2016-03-01 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175093#comment-15175093
 ] 

Chip Senkbeil commented on SPARK-13573:
---

I'd gladly create a PR with the changes if needed. We haven't synced with Spark 
1.6.0+ yet, so it'd just take me a little bit to get up to speed. Other than 
the one new method to enable connecting without creating a Spark Context, it's 
just exporting functions and switching the RBackend class to be public.

> Open SparkR APIs (R package) to allow better 3rd party usage
> 
>
> Key: SPARK-13573
> URL: https://issues.apache.org/jira/browse/SPARK-13573
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Chip Senkbeil
>
> Currently, SparkR's R package does not expose enough of its APIs to be used 
> flexibly. That I am aware of, SparkR still requires you to create a new 
> SparkContext by invoking the sparkR.init method (so you cannot connect to a 
> running one) and there is no way to invoke custom Java methods using the 
> exposed SparkR API (unlike PySpark).
> We currently maintain a fork of SparkR that is used to power the R 
> implementation of Apache Toree, which is a gateway to use Apache Spark. This 
> fork provides a connect method (to use an existing Spark Context), exposes 
> needed methods like invokeJava (to be able to communicate with our JVM to 
> retrieve code to run, etc), and uses reflection to access 
> org.apache.spark.api.r.RBackend.
> Here is the documentation I recorded regarding changes we need to enable 
> SparkR as an option for Apache Toree: 
> https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage

2016-03-01 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175089#comment-15175089
 ] 

Chip Senkbeil commented on SPARK-13573:
---

In terms of the JVM class whose methods we are invoking, the majority can be 
found here: 
https://github.com/apache/incubator-toree/blob/master/kernel-api/src/main/scala/org/apache/toree/interpreter/broker/BrokerState.scala#L90

We basically maintain an object that acts as a code queue where the SparkR 
process pulls off code to evaluate and then sends back results as strings.

We also had to write a wrapper for the RBackend since it was package protected: 
https://github.com/apache/incubator-toree/blob/master/sparkr-interpreter/src/main/scala/org/apache/toree/kernel/interpreter/sparkr/ReflectiveRBackend.scala

> Open SparkR APIs (R package) to allow better 3rd party usage
> 
>
> Key: SPARK-13573
> URL: https://issues.apache.org/jira/browse/SPARK-13573
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Chip Senkbeil
>
> Currently, SparkR's R package does not expose enough of its APIs to be used 
> flexibly. That I am aware of, SparkR still requires you to create a new 
> SparkContext by invoking the sparkR.init method (so you cannot connect to a 
> running one) and there is no way to invoke custom Java methods using the 
> exposed SparkR API (unlike PySpark).
> We currently maintain a fork of SparkR that is used to power the R 
> implementation of Apache Toree, which is a gateway to use Apache Spark. This 
> fork provides a connect method (to use an existing Spark Context), exposes 
> needed methods like invokeJava (to be able to communicate with our JVM to 
> retrieve code to run, etc), and uses reflection to access 
> org.apache.spark.api.r.RBackend.
> Here is the documentation I recorded regarding changes we need to enable 
> SparkR as an option for Apache Toree: 
> https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13607:


Assignee: (was: Apache Spark)

> Improves compression performance for integer-typed values on cache to reduce 
> GC pressure
> 
>
> Key: SPARK-13607
> URL: https://issues.apache.org/jira/browse/SPARK-13607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13607:


Assignee: Apache Spark

> Improves compression performance for integer-typed values on cache to reduce 
> GC pressure
> 
>
> Key: SPARK-13607
> URL: https://issues.apache.org/jira/browse/SPARK-13607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175085#comment-15175085
 ] 

Apache Spark commented on SPARK-13607:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/11461

> Improves compression performance for integer-typed values on cache to reduce 
> GC pressure
> 
>
> Key: SPARK-13607
> URL: https://issues.apache.org/jira/browse/SPARK-13607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage

2016-03-01 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175079#comment-15175079
 ] 

Chip Senkbeil commented on SPARK-13573:
---

[~sunrui], IIRC, Toree supported SparkR from 1.4.x and 1.5.x. Just a bit of a 
pain to keep in sync.

So, the process Toree uses the methods to interact with SparkR is as follows:

# We added a SparkR.connect method 
(https://github.com/apache/incubator-toree/blob/master/sparkr-interpreter/src/main/resources/R/pkg/R/sparkR.R#L220)
 that uses the EXISTING_SPARKR_BACKEND_PORT to connect to an R backend but does 
not attempt to initialize the Spark Context
# We use the exposed callJStatic to acquire a reference to a Java (well, Scala) 
object that has additional variables like the Spark Context hanging off of it 
(https://github.com/apache/incubator-toree/blob/master/sparkr-interpreter/src/main/resources/kernelR/sparkr_runner.R#L50)
 {code}# Retrieve the bridge used to perform actions on the JVM
bridge <- callJStatic(
  "org.apache.toree.kernel.interpreter.sparkr.SparkRBridge", "sparkRBridge"
)

# Retrieve the state used to pull code off the JVM and push results back
state <- callJMethod(bridge, "state")

# Acquire the kernel API instance to expose
kernel <- callJMethod(bridge, "kernel")
assign("kernel", kernel, .runnerEnv){code}
# We then invoke methods using callJMethod to get the next string of R code to 
evaluate {code}# Load the conainer of the code
  codeContainer <- callJMethod(state, "nextCode")

  # If not valid result, wait 1 second and try again
  if (!class(codeContainer) == "jobj") {
Sys.sleep(1)
next()
  }

  # Retrieve the code id (for response) and code
  codeId <- callJMethod(codeContainer, "codeId")
  code <- callJMethod(codeContainer, "code"){code}
# Finally, we evaluate the acquired code string and send the results back to 
our running JVM (which represents a Jupyter kernel) {code}  # Parse the code 
into an expression to be evaluated
  codeExpr <- parse(text = code)
  print(paste("Code expr", codeExpr))

  tryCatch({
# Evaluate the code provided and capture the result as a string
result <- capture.output(eval(codeExpr, envir = .runnerEnv))
print(paste("Result type", class(result), length(result)))
print(paste("Success", codeId, result))

# Mark the execution as a success and send back the result
# If output is null/empty, ensure that we can send it (otherwise fails)
if (is.null(result) || length(result) <= 0) {
  print("Marking success with no output")
  callJMethod(state, "markSuccess", codeId)
} else {
  # Clean the result before sending it back
  cleanedResult <- trimws(flatten(result, shouldTrim = FALSE))

  print(paste("Marking success with output:", cleanedResult))
  callJMethod(state, "markSuccess", codeId, cleanedResult)
}
  }, error = function(ex) {
# Mark the execution as a failure and send back the error
print(paste("Failure", codeId, toString(ex)))
callJMethod(state, "markFailure", codeId, toString(ex))
  }){code}

> Open SparkR APIs (R package) to allow better 3rd party usage
> 
>
> Key: SPARK-13573
> URL: https://issues.apache.org/jira/browse/SPARK-13573
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Chip Senkbeil
>
> Currently, SparkR's R package does not expose enough of its APIs to be used 
> flexibly. That I am aware of, SparkR still requires you to create a new 
> SparkContext by invoking the sparkR.init method (so you cannot connect to a 
> running one) and there is no way to invoke custom Java methods using the 
> exposed SparkR API (unlike PySpark).
> We currently maintain a fork of SparkR that is used to power the R 
> implementation of Apache Toree, which is a gateway to use Apache Spark. This 
> fork provides a connect method (to use an existing Spark Context), exposes 
> needed methods like invokeJava (to be able to communicate with our JVM to 
> retrieve code to run, etc), and uses reflection to access 
> org.apache.spark.api.r.RBackend.
> Here is the documentation I recorded regarding changes we need to enable 
> SparkR as an option for Apache Toree: 
> https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13435) Add Weighted Cohen's kappa to MulticlassMetrics

2016-03-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13435:
--
Shepherd:   (was: Xiangrui Meng)

> Add Weighted Cohen's kappa to MulticlassMetrics
> ---
>
> Key: SPARK-13435
> URL: https://issues.apache.org/jira/browse/SPARK-13435
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> Add the missing Weighted Cohen's kappa to MulticlassMetrics.
> Kappa is widely used in Competition and Statistics.
> https://en.wikipedia.org/wiki/Cohen's_kappa
> Some usage examples:
> val metrics = new MulticlassMetrics(predictionAndLabels)
> // The default kappa value (Unweighted kappa)
> val kappa = metrics.kappa
> // Three built-in weighting type ("default":unweighted, "linear":linear 
> weighted, "quadratic":quadratic weighted)
> val kappa = metrics.kappa("quadratic")
> // User-defined weighting matrix
> val matrix = Matrices.dense(n, n, values)
> val kappa = metrics.kappa(matrix)
> // User-defined weighting function
> def getWeight(i: Int, j:Int):Double = {
>   if (i == j) {
> 0.0
>   } else {
> 1.0
>   }
> }
> val kappa = metrics.kappa(getWeight) // equals to the unweighted kappa
> The calculation correctness was tested on several small data, and compared to 
> two python's package:  sklearn and ml_metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175066#comment-15175066
 ] 

Reynold Xin edited comment on SPARK-12177 at 3/2/16 5:32 AM:
-

This thread is getting to long for me to follow, but my instinct is that maybe 
we should have two subprojects and support both. Otherwise it is very bad for 
Kafka 0.8 users when upgrading to Spark 2.0.

It's much more difficult to upgrade Kafka which is a message bus than just 
upgrading Spark.


was (Author: rxin):
This thread is getting to long for me to follow, but my instinct is that maybe 
we should have two subprojects and support both.



> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175066#comment-15175066
 ] 

Reynold Xin commented on SPARK-12177:
-

This thread is getting to long for me to follow, but my instinct is that maybe 
we should have two subprojects and support both.



> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization

2016-03-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13322:
--
Shepherd: DB Tsai  (was: Xiangrui Meng)

> AFTSurvivalRegression should support feature standardization
> 
>
> Key: SPARK-13322
> URL: https://issues.apache.org/jira/browse/SPARK-13322
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> This bug is reported by Stuti Awasthi.
> https://www.mail-archive.com/user@spark.apache.org/msg45643.html
> The lossSum has possibility of infinity because we do not standardize the 
> feature before fitting model, we should support feature standardization.
> Another benefit is that standardization will improve the convergence rate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13010) Survival analysis in SparkR

2016-03-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13010:
--
Shepherd: yuhao yang  (was: Xiangrui Meng)

> Survival analysis in SparkR
> ---
>
> Key: SPARK-13010
> URL: https://issues.apache.org/jira/browse/SPARK-13010
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Implement a simple wrapper of AFTSurvivalRegression in SparkR to support 
> survival analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13008) Make ML Python package all list have one algorithm per line

2016-03-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13008.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10927
[https://github.com/apache/spark/pull/10927]

> Make ML Python package all list have one algorithm per line
> ---
>
> Key: SPARK-13008
> URL: https://issues.apache.org/jira/browse/SPARK-13008
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
> Fix For: 2.0.0
>
>
> This is to fix a long-time annoyance: Whenever we add a new algorithm to 
> pyspark.ml, we have to add it to the {{__all__}} list at the top.  Since we 
> keep it alphabetized, it often creates a lot more changes than needed.  It is 
> also easy to add the Estimator and forget the Model.  I'm going to switch it 
> to have one algorithm per line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-01 Thread Varadharajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175052#comment-15175052
 ] 

Varadharajan commented on SPARK-13393:
--

[~adrian-wang] Thanks a lot :)

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175045#comment-15175045
 ] 

Jeff Zhang commented on SPARK-13587:


spark.pyspark.virtualenv.requirements is a local file (which would be 
distributed to all nodes) Regarding upgrade these to first-class citizens, I 
would be conservative for that. Needs more feedback from other users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13612) Multiplication of BigDecimal columns not working as expected

2016-03-01 Thread Varadharajan (JIRA)
Varadharajan created SPARK-13612:


 Summary: Multiplication of BigDecimal columns not working as 
expected
 Key: SPARK-13612
 URL: https://issues.apache.org/jira/browse/SPARK-13612
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Varadharajan


Please consider the below snippet:

{code}
case class AM(id: Int, a: BigDecimal)
case class AX(id: Int, b: BigDecimal)
val x = sc.parallelize(List(AM(1, 10))).toDF
val y = sc.parallelize(List(AX(1, 10))).toDF
x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show
{code}

output:

{code}
| id|   a| id|   b|   z|
|  1|10.00...|  1|10.00...|null|
{code}

Here the multiplication of the columns ("z") return null instead of 100.

As of now we are using the below workaround, but definitely looks like a 
serious issue.

{code}
x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / 
y("b"))).show
{code}

{code}
| id|   a| id|   b|   z|
|  1|10.00...|  1|10.00...|100.0...|
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-01 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175034#comment-15175034
 ] 

Adrian Wang commented on SPARK-13393:
-

[~srinathsmn] I have identified the issue, and working on this.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-01 Thread Varadharajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175033#comment-15175033
 ] 

Varadharajan commented on SPARK-13393:
--

[~rxin] [~marmbrus] Can you share some inputs on this?

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175018#comment-15175018
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

One thought that just occurred to me, does 
{{spark.pyspark.virtualenv.requirements}} point to a path on the master node 
for a requirements file? It'd make sense if that was the case and then the 
requirements file was shipped to other nodes instead of assuming that this file 
existed on all Spark nodes at the same location.

Also might be a good idea to upgrade these to first-class citizens of 
spark-submit by supporting them as optional params instead of config 
properties. I'd go so far as to say it makes sense to deprecate {{--py-files}} 
in favour of:

* {{--py-venv-type=conda}}
* {{--py-venv-bin=/path/to/conda}}
* {{--py-venv-requirements=/local/path/to/requirements.txt}}


> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13609) Support Column Pruning for MapPartitions

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13609:


Assignee: Apache Spark

> Support Column Pruning for MapPartitions
> 
>
> Key: SPARK-13609
> URL: https://issues.apache.org/jira/browse/SPARK-13609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {code}
> case class OtherTuple(_1: String, _2: Int)
> val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS()
> ds.as[OtherTuple].map(identity[OtherTuple]).explain(true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13609) Support Column Pruning for MapPartitions

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13609:


Assignee: (was: Apache Spark)

> Support Column Pruning for MapPartitions
> 
>
> Key: SPARK-13609
> URL: https://issues.apache.org/jira/browse/SPARK-13609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {code}
> case class OtherTuple(_1: String, _2: Int)
> val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS()
> ds.as[OtherTuple].map(identity[OtherTuple]).explain(true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13609) Support Column Pruning for MapPartitions

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175011#comment-15175011
 ] 

Apache Spark commented on SPARK-13609:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11460

> Support Column Pruning for MapPartitions
> 
>
> Key: SPARK-13609
> URL: https://issues.apache.org/jira/browse/SPARK-13609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {code}
> case class OtherTuple(_1: String, _2: Int)
> val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS()
> ds.as[OtherTuple].map(identity[OtherTuple]).explain(true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13611) import Aggregator doesn't work in Spark Shell

2016-03-01 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13611:
---

 Summary: import Aggregator doesn't work in Spark Shell
 Key: SPARK-13611
 URL: https://issues.apache.org/jira/browse/SPARK-13611
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan


{code}
scala> import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.expressions.Aggregator

scala> class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with 
Serializable {
 |   val numeric = implicitly[Numeric[N]]
 |   override def zero: N = numeric.zero
 |   override def reduce(b: N, a: I): N = numeric.plus(b, f(a))
 |   override def merge(b1: N,b2: N): N = numeric.plus(b1, b2)
 |   override def finish(reduction: N): N = reduction
 | }
:10: error: not found: type Aggregator
   class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with 
Serializable {
  ^
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175010#comment-15175010
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

Gotcha. I might suggest {{spark.pyspark.virtualenv.bin.path}} in that case.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression

2016-03-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175006#comment-15175006
 ] 

Gayathri Murali commented on SPARK-13025:
-

https://github.com/apache/spark/pull/11459

> Allow user to specify the initial model when training LogisticRegression
> 
>
> Key: SPARK-13025
> URL: https://issues.apache.org/jira/browse/SPARK-13025
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> Allow the user to set the initial model when training for logistic 
> regression. Note the method already exists, just change visibility to public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175003#comment-15175003
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for your feedback [~msukmanowsky]. spark.pyspark.virtualenv.path is not 
the path where the virtualenv created, it is the path to the executable file 
for virtualenv/conda which is used for creating virtualenv ( I need to rename 
it to a more proper name to avoid confusing). 
In my POC, I will create virtualenv in all the executors not only driver. As 
you said, some python packages depends on C library, we can not guarantee it 
would work if we compile it in driver and distribute it to other nodes. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13610) Create a Transformer to disassemble vectors in DataFrames

2016-03-01 Thread Andrew MacKinlay (JIRA)
Andrew MacKinlay created SPARK-13610:


 Summary: Create a Transformer to disassemble vectors in DataFrames
 Key: SPARK-13610
 URL: https://issues.apache.org/jira/browse/SPARK-13610
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 1.6.0
Reporter: Andrew MacKinlay
Priority: Minor


It is possible to convert a standalone numeric field into a single-item Vector, 
using VectorAssembler. However the inverse operation of retrieving a single 
item from a vector and translating it into a field doesn't appear to be 
possible. The workaround I've found is to leave the raw field value in the DF, 
but I have found no other ways to get a field out of a vector (eg to perform 
arithmetic on it). Happy to be proved wrong though. Creating a user-defined 
function doesn't work (in Python at least; it gets a pickleexception).This 
seems like a simple operation which should be supported for various use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13609) Support Column Pruning for MapPartitions

2016-03-01 Thread Xiao Li (JIRA)
Xiao Li created SPARK-13609:
---

 Summary: Support Column Pruning for MapPartitions
 Key: SPARK-13609
 URL: https://issues.apache.org/jira/browse/SPARK-13609
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


{code}
case class OtherTuple(_1: String, _2: Int)
val ds = Seq(("a", 1, 3), ("b", 2, 4), ("c", 3, 5)).toDS()
ds.as[OtherTuple].map(identity[OtherTuple]).explain(true)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:17 AM:


This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(flag to enable virtualenv)
* spark.pyspark.virtualenv.type  (native/conda are supported, default is native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virtualenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virutalenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13608) py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class java.util.HashMap, class java.util.HashSet, class ja

2016-03-01 Thread Avatar Zhang (JIRA)
Avatar Zhang created SPARK-13608:


 Summary: py4j.Py4JException: Method createDirectStream([class 
org.apache.spark.streaming.api.java.JavaStreamingContext, class 
java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not 
exist
 Key: SPARK-13608
 URL: https://issues.apache.org/jira/browse/SPARK-13608
 Project: Spark
  Issue Type: Bug
Reporter: Avatar Zhang


py4j.Py4JException: Method createDirectStream([class 
org.apache.spark.streaming.api.java.JavaStreamingContext, class 
java.util.HashMap, class java.util.HashSet, class java.util.HashMap]) does not 
exist

at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)

at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)

at py4j.Gateway.invoke(Gateway.java:252)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:207)

at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:12 AM:


This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virutalenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174996#comment-15174996
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

Thanks for letting me know about this [~jeffzhang].

I think in general, I'm +1 on the proposal.

virtualenvs are the way to go to install requirements and ensure isolation of 
dependencies between multiple driver scripts. As you noted though, installing 
hefty requirements like pandas or numpy (assuming you aren't using Conda), 
would add a pretty significant overhead to startup which could be amortized if 
the driver was assumed to run for a long enough period of time. Conda of course 
would pretty well eliminate that problem as it provides pre-compiled binaries 
for most OSs.

I'd like to offer [PEX|https://pex.readthedocs.org/en/stable/] as an 
alternative, where spark-submit would build a self-contained virtualenv in a 
.pex file on the Spark master node and then distribute to all other nodes. 
However, it turns out PEX doesn't support editable requirements and introduces 
an assumption that all nodes in a cluster are homogenous so that a Python 
package with C extensions compiled on the master node would run on worker nodes 
without issue. The latter assumption may be a leap too far for all Spark users.

One thing I'm not entirely sure of is the need for the 
spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why 
would this path ever be specified? Wouldn't a temporary path be used and 
subsequently removed after the Python worker completes?

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression

2016-03-01 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-13025:

Comment: was deleted

(was: PR : https://github.com/apache/spark/pull/11458)

> Allow user to specify the initial model when training LogisticRegression
> 
>
> Key: SPARK-13025
> URL: https://issues.apache.org/jira/browse/SPARK-13025
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> Allow the user to set the initial model when training for logistic 
> regression. Note the method already exists, just change visibility to public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13025) Allow user to specify the initial model when training LogisticRegression

2016-03-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174983#comment-15174983
 ] 

Gayathri Murali commented on SPARK-13025:
-

PR : https://github.com/apache/spark/pull/11458

> Allow user to specify the initial model when training LogisticRegression
> 
>
> Key: SPARK-13025
> URL: https://issues.apache.org/jira/browse/SPARK-13025
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> Allow the user to set the initial model when training for logistic 
> regression. Note the method already exists, just change visibility to public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding

2016-03-01 Thread Avatar Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174953#comment-15174953
 ] 

Avatar Zhang commented on SPARK-13606:
--

/usr/local/bin/python2.7 can launch normally.

[root@iZ28x4dqt1oZ ~]# python2.7
Python 2.7.11 (default, Mar  2 2016, 10:20:14)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

> Error from python worker:   /usr/local/bin/python2.7: undefined symbol: 
> _PyCodec_LookupTextEncoding
> ---
>
> Key: SPARK-13606
> URL: https://issues.apache.org/jira/browse/SPARK-13606
> Project: Spark
>  Issue Type: Bug
>Reporter: Avatar Zhang
>
> Error from python worker:
>   /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: 
> undefined symbol: _PyCodec_LookupTextEncoding
> PYTHONPATH was:
>   
> /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174945#comment-15174945
 ] 

Jeff Zhang commented on SPARK-6764:
---

[~msukmanowsky] Can SPARK-13587 solve your issue ? I am working on it, welcome 
any comments. 

> Add wheel package support for PySpark
> -
>
> Key: SPARK-6764
> URL: https://issues.apache.org/jira/browse/SPARK-6764
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, PySpark
>Reporter: Takao Magoori
>Priority: Minor
>  Labels: newbie
>
> We can do _spark-submit_ with one or more Python packages (.egg,.zip and 
> .jar) by *--py-files* option.
> h4. zip packaging
> Spark put a zip file on its working directory and adds the absolute path to 
> Python's sys.path. When the user program imports it, 
> [zipimport|https://docs.python.org/2.7/library/zipimport.html] is 
> automatically invoked under the hood. That is, data-files and dynamic 
> modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and 
> .pyo.
> h4. egg packaging
> Spark put an egg file on its working directory and adds the absolute path to 
> Python's sys.path. Unlike zipimport, egg can handle data files and dynamid 
> modules as far as the author of the package uses [pkg_resources 
> API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
>  properly. But so many python modules does not use pkg_resources API, that 
> causes "ImportError"or "No such file" error. Moreover, creating eggs of 
> dependencies and further dependencies are troublesome job.
> h4. wheel packaging
> Supporting new Python standard package-format 
> "[wheel|https://wheel.readthedocs.org/en/latest/]; would be nice. With wheel, 
> we can do spark-submit with complex dependencies simply as follows.
> 1. Write requirements.txt file.
> {noformat}
> SQLAlchemy
> MySQL-python
> requests
> simplejson>=3.6.0,<=3.6.5
> pydoop
> {noformat}
> 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
> {noformat}
> $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement 
> requirements.txt
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
> {noformat}
> If your pyspark driver is a package which consists of many modules,
> 1. Write setup.py for your pyspark driver package.
> {noformat}
> from setuptools import (
> find_packages,
> setup,
> )
> setup(
> name='yourpkg',
> version='0.0.1',
> packages=find_packages(),
> install_requires=[
> 'SQLAlchemy',
> 'MySQL-python',
> 'requests',
> 'simplejson>=3.6.0,<=3.6.5',
> 'pydoop',
> ],
> )
> {noformat}
> 2. Do wheel packaging by only one command. Your driver package and all 
> dependencies are wheel-ed.
> {noformat}
> your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') 
> your_driver_bootstrap.py
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174941#comment-15174941
 ] 

Jeff Zhang commented on SPARK-13606:


This might be python environment issue. Can you launch python on that machine 
manually ?

> Error from python worker:   /usr/local/bin/python2.7: undefined symbol: 
> _PyCodec_LookupTextEncoding
> ---
>
> Key: SPARK-13606
> URL: https://issues.apache.org/jira/browse/SPARK-13606
> Project: Spark
>  Issue Type: Bug
>Reporter: Avatar Zhang
>
> Error from python worker:
>   /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: 
> undefined symbol: _PyCodec_LookupTextEncoding
> PYTHONPATH was:
>   
> /usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-01 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174943#comment-15174943
 ] 

Gayathri Murali commented on SPARK-13073:
-

I can work on this, can you please assign it to me?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure

2016-03-01 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13607:


 Summary: Improves compression performance for integer-typed values 
on cache to reduce GC pressure
 Key: SPARK-13607
 URL: https://issues.apache.org/jira/browse/SPARK-13607
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13581) LibSVM throws MatchError

2016-03-01 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-13581:
---
Priority: Critical  (was: Minor)

> LibSVM throws MatchError
> 
>
> Key: SPARK-13581
> URL: https://issues.apache.org/jira/browse/SPARK-13581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jakob Odersky
>Assignee: Jeff Zhang
>Priority: Critical
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is in spark repository
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
> {code}
> This issue first appeared in commit {{1dac964c1}}, in PR 
> [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.
> [~jeffzhang], do you have any insight of what could be going on?
> cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13141) Dataframe created from Hive partitioned tables using HiveContext returns wrong results

2016-03-01 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13141.

Resolution: Not A Problem

Hi, this was a bug in CDH 5.5.0/5.5.1, it was fixed in CDH 5.5.2. Sorry about 
the trouble.

> Dataframe created from Hive partitioned tables using HiveContext returns 
> wrong results
> --
>
> Key: SPARK-13141
> URL: https://issues.apache.org/jira/browse/SPARK-13141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> I get wrong dataframe results using HiveContext with Spark 1.5.0 on CDH 5.5.1 
> in yarn-client mode.
> The problem occurs with partitioned tables on text delimited HDFS data, both 
> with Scala and Python.
> This an example code:
> import org.apache.spark.sql.hive.HiveContext
> val hc = new HiveContext(sc)
> hc.table("my_db.partition_table").show()
> The result is that all values of all rows are NULL, except from the first 
> column (that contains the whole line of data) and the partitioning columns, 
> which appears to be correct.
> With Hive and Impala I get correct results.
> Also with Spark on the same data with a not partitioned table I get correct 
> results.
> I think that similar problems occurs also with Avro data:
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Pyspark-Table-Dataframe-returning-empty-records-from-Partitioned/td-p/35836



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13511) Add wholestage codegen for limit

2016-03-01 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174919#comment-15174919
 ] 

Liang-Chi Hsieh commented on SPARK-13511:
-

[~davies] Can you help update the Assignee field? Thanks!

> Add wholestage codegen for limit
> 
>
> Key: SPARK-13511
> URL: https://issues.apache.org/jira/browse/SPARK-13511
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Current limit operator doesn't support wholestage codegen. This issue is open 
> to add support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13174) Add API and options for csv data sources

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174912#comment-15174912
 ] 

Apache Spark commented on SPARK-13174:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11457

> Add API and options for csv data sources
> 
>
> Key: SPARK-13174
> URL: https://issues.apache.org/jira/browse/SPARK-13174
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>
> We should have a API to load csv data source (with some options as 
> arguments), similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13174) Add API and options for csv data sources

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13174:


Assignee: (was: Apache Spark)

> Add API and options for csv data sources
> 
>
> Key: SPARK-13174
> URL: https://issues.apache.org/jira/browse/SPARK-13174
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>
> We should have a API to load csv data source (with some options as 
> arguments), similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13174) Add API and options for csv data sources

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13174:


Assignee: Apache Spark

> Add API and options for csv data sources
> 
>
> Key: SPARK-13174
> URL: https://issues.apache.org/jira/browse/SPARK-13174
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> We should have a API to load csv data source (with some options as 
> arguments), similar to json() and jdbc()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13606) Error from python worker: /usr/local/bin/python2.7: undefined symbol: _PyCodec_LookupTextEncoding

2016-03-01 Thread Avatar Zhang (JIRA)
Avatar Zhang created SPARK-13606:


 Summary: Error from python worker:   /usr/local/bin/python2.7: 
undefined symbol: _PyCodec_LookupTextEncoding
 Key: SPARK-13606
 URL: https://issues.apache.org/jira/browse/SPARK-13606
 Project: Spark
  Issue Type: Bug
Reporter: Avatar Zhang


Error from python worker:
  /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: 
undefined symbol: _PyCodec_LookupTextEncoding
PYTHONPATH was:
  
/usr/share/dse/spark/python/lib/pyspark.zip:/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/share/dse/spark/lib/spark-core_2.10-1.4.2.2.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:315)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13141) Dataframe created from Hive partitioned tables using HiveContext returns wrong results

2016-03-01 Thread zhichao-li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174895#comment-15174895
 ] 

zhichao-li edited comment on SPARK-13141 at 3/2/16 2:29 AM:


Just try, but this cannot be reproduced from the master version by : 

create table mn.logs (field1 string, field2 string, field3 string)
partitioned by (year string, month string , day string, host string)
row format delimited fields terminated by ',';

insert into logs partition (year="2013", month="07", day="28", host="host1") 
values ("foo","foo","foo")

hc.table("logs").show()


 as you mentioned, not sure if it's specific to the version of CDH 5.5.1


was (Author: zhichao-li):
Just try, but this cannot be reproduced from the master version by the sql: 
`create table mn.logs (field1 string, field2 string, field3 string)
partitioned by (year string, month string , day string, host string)
row format delimited fields terminated by ',';` as you mentioned, not sure if 
it's specific to the version of CDH 5.5.1

> Dataframe created from Hive partitioned tables using HiveContext returns 
> wrong results
> --
>
> Key: SPARK-13141
> URL: https://issues.apache.org/jira/browse/SPARK-13141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> I get wrong dataframe results using HiveContext with Spark 1.5.0 on CDH 5.5.1 
> in yarn-client mode.
> The problem occurs with partitioned tables on text delimited HDFS data, both 
> with Scala and Python.
> This an example code:
> import org.apache.spark.sql.hive.HiveContext
> val hc = new HiveContext(sc)
> hc.table("my_db.partition_table").show()
> The result is that all values of all rows are NULL, except from the first 
> column (that contains the whole line of data) and the partitioning columns, 
> which appears to be correct.
> With Hive and Impala I get correct results.
> Also with Spark on the same data with a not partitioned table I get correct 
> results.
> I think that similar problems occurs also with Avro data:
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Pyspark-Table-Dataframe-returning-empty-records-from-Partitioned/td-p/35836



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174899#comment-15174899
 ] 

Mike Sukmanowsky commented on SPARK-6764:
-

Just bumping this issue up. We use Spark (PySpark) pretty extensively and would 
love the ability to use wheels in addition to eggs with spark-submit.

> Add wheel package support for PySpark
> -
>
> Key: SPARK-6764
> URL: https://issues.apache.org/jira/browse/SPARK-6764
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, PySpark
>Reporter: Takao Magoori
>Priority: Minor
>  Labels: newbie
>
> We can do _spark-submit_ with one or more Python packages (.egg,.zip and 
> .jar) by *--py-files* option.
> h4. zip packaging
> Spark put a zip file on its working directory and adds the absolute path to 
> Python's sys.path. When the user program imports it, 
> [zipimport|https://docs.python.org/2.7/library/zipimport.html] is 
> automatically invoked under the hood. That is, data-files and dynamic 
> modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and 
> .pyo.
> h4. egg packaging
> Spark put an egg file on its working directory and adds the absolute path to 
> Python's sys.path. Unlike zipimport, egg can handle data files and dynamid 
> modules as far as the author of the package uses [pkg_resources 
> API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
>  properly. But so many python modules does not use pkg_resources API, that 
> causes "ImportError"or "No such file" error. Moreover, creating eggs of 
> dependencies and further dependencies are troublesome job.
> h4. wheel packaging
> Supporting new Python standard package-format 
> "[wheel|https://wheel.readthedocs.org/en/latest/]; would be nice. With wheel, 
> we can do spark-submit with complex dependencies simply as follows.
> 1. Write requirements.txt file.
> {noformat}
> SQLAlchemy
> MySQL-python
> requests
> simplejson>=3.6.0,<=3.6.5
> pydoop
> {noformat}
> 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
> {noformat}
> $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement 
> requirements.txt
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
> {noformat}
> If your pyspark driver is a package which consists of many modules,
> 1. Write setup.py for your pyspark driver package.
> {noformat}
> from setuptools import (
> find_packages,
> setup,
> )
> setup(
> name='yourpkg',
> version='0.0.1',
> packages=find_packages(),
> install_requires=[
> 'SQLAlchemy',
> 'MySQL-python',
> 'requests',
> 'simplejson>=3.6.0,<=3.6.5',
> 'pydoop',
> ],
> )
> {noformat}
> 2. Do wheel packaging by only one command. Your driver package and all 
> dependencies are wheel-ed.
> {noformat}
> your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') 
> your_driver_bootstrap.py
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13141) Dataframe created from Hive partitioned tables using HiveContext returns wrong results

2016-03-01 Thread zhichao-li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174895#comment-15174895
 ] 

zhichao-li commented on SPARK-13141:


Just try, but this cannot be reproduced from the master version by the sql: 
`create table mn.logs (field1 string, field2 string, field3 string)
partitioned by (year string, month string , day string, host string)
row format delimited fields terminated by ',';` as you mentioned, not sure if 
it's specific to the version of CDH 5.5.1

> Dataframe created from Hive partitioned tables using HiveContext returns 
> wrong results
> --
>
> Key: SPARK-13141
> URL: https://issues.apache.org/jira/browse/SPARK-13141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> I get wrong dataframe results using HiveContext with Spark 1.5.0 on CDH 5.5.1 
> in yarn-client mode.
> The problem occurs with partitioned tables on text delimited HDFS data, both 
> with Scala and Python.
> This an example code:
> import org.apache.spark.sql.hive.HiveContext
> val hc = new HiveContext(sc)
> hc.table("my_db.partition_table").show()
> The result is that all values of all rows are NULL, except from the first 
> column (that contains the whole line of data) and the partitioning columns, 
> which appears to be correct.
> With Hive and Impala I get correct results.
> Also with Spark on the same data with a not partitioned table I get correct 
> results.
> I think that similar problems occurs also with Avro data:
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Pyspark-Table-Dataframe-returning-empty-records-from-Partitioned/td-p/35836



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174856#comment-15174856
 ] 

Mark Grover commented on SPARK-12177:
-

Hi [~tdas] and [~rxin], can you help us with your opinion on these questions, 
so we can unblock this work:
1. Should we support both Kafka 0.8 and 0.9 or just 0.9? The pros and cons are 
listed [here|https://github.com/apache/spark/pull/11143#issuecomment-182154267] 
along with what other projects are doing.
2. Should we make a separate project for the implementation using the new kafka 
consumer API with the same class names (e.g. KafkaRDD, etc.), or create new 
classes like hadoop did, in the same subproject e.g. NewKafkaRDD, etc.

Thanks!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public

2016-03-01 Thread Randall Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169380#comment-15169380
 ] 

Randall Whitman edited comment on SPARK-7768 at 3/2/16 1:47 AM:


Am I missing something?

As far as I can see, the @DeveloperApi annotation is still present on class 
UserDefinedType - 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala
 @Experimental annotation is still present on class UserDefinedFunction 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 (corrected as noted by Jaka Jancar).

Also, I have not seen any mention of having addressed the design issue of using 
@SQLUserDefinedType with third-party libraries, that is discussed in this JIRA, 
2015/05/21 through 2015/06/12.


was (Author: randallwhitman):
Am I missing something?

As far as I can see, the @Experimental annotation is still present on class 
UserDefinedFunction - 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

Also, I have not seen any mention of having addressed the design issue of using 
@SQLUserDefinedType with third-party libraries, that is discussed in this JIRA, 
2015/05/21 through 2015/06/12.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13167) JDBC data source does not include null value partition columns rows in the result.

2016-03-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13167.
-
   Resolution: Fixed
 Assignee: Suresh Thalamati
Fix Version/s: 2.0.0

> JDBC data source does not include null value partition columns rows in the 
> result.
> --
>
> Key: SPARK-13167
> URL: https://issues.apache.org/jira/browse/SPARK-13167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Suresh Thalamati
>Assignee: Suresh Thalamati
> Fix For: 2.0.0
>
>
> Reading from am JDBC data source using a partition column that is nullable 
> can return incorrect number of rows, if there are rows with null value for 
> partition column.
> {code}
> val emp = 
> sqlContext.read.jdbc("jdbc:h2:mem:testdb0;user=testUser;password=testPass", 
> "TEST.EMP", "theid", 0, 4, 3, new Properties)
> emp.count()
> {code}
> Above jdbc read call sets up the partitions of the following form. It does 
> not include null predicate.
> {code}
> JDBCPartition(THEID < 1,0),JDBCPartition(THEID >= 1 AND THEID < 
> 2,1),JDBCPartition(THEID >= 2,2)
> {code}
> Rows with null values in partition column are not included in the results 
> because the partition predicate does not specify is null predicates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174809#comment-15174809
 ] 

Łukasz Gieroń commented on SPARK-13230:
---

[~srowen] Can you please assign me to this ticket? I have a pretty strong 
suspicion as to what is going on here, but would like to confirm it with scala 
library folks first before I speak here.

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2016-03-01 Thread Jaka Jancar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174807#comment-15174807
 ] 

Jaka Jancar commented on SPARK-7768:


[~randallwhitman] UDT, not UDF: 
https://github.com/apache/spark/blob/v1.6.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13598) Remove LeftSemiJoinBNL

2016-03-01 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13598.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11448
[https://github.com/apache/spark/pull/11448]

> Remove LeftSemiJoinBNL
> --
>
> Key: SPARK-13598
> URL: https://issues.apache.org/jira/browse/SPARK-13598
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> Broadcast left semi join without joining keys is already supported in 
> BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, 
> we should remove that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-01 Thread Steven Lewis (JIRA)
Steven Lewis created SPARK-13605:


 Summary: Bean encoder cannot handle nonbean properties - no way to 
Encode nonbean Java objects with columns
 Key: SPARK-13605
 URL: https://issues.apache.org/jira/browse/SPARK-13605
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Affects Versions: 1.6.0
 Environment: Any
Reporter: Steven Lewis
 Fix For: 1.6.0


in the current environment the only way to turn a List or JavaRDD into a 
DataSet with columns is to use a  Encoders.bean(MyBean.class); The current 
implementation fails if a Bean property is not a basic type or a Bean.
I would like to see one of the following
1) Default to JavaSerialization for any Java Object implementing Serializable 
when using bean Encoder
2) Allow an encoder which is a Map and look up entries in 
encoding classes - an ideal implementation would look for the class then any 
interfaces and then search base classes 

The following code illustrates the issue
/**
 * This class is a good Java bean but one field holds an object
 * which is not a bean
 */
public class MyBean  implements Serializable {
private int m_count;
private String m_Name;
private MyUnBean m_UnBean;

public MyBean(int count, String name, MyUnBean unBean) {
m_count = count;
m_Name = name;
m_UnBean = unBean;
}

public int getCount() {return m_count; }
public void setCount(int count) {m_count = count;}
public String getName() {return m_Name;}
public void setName(String name) {m_Name = name;}
public MyUnBean getUnBean() {return m_UnBean;}
public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;}
}
/**
 * This is a Java object which is not a bean
 * no getters or setters but is serializable
 */
public class MyUnBean implements Serializable {
public final int count;
public final String name;

public MyUnBean(int count, String name) {
this.count = count;
this.name = name;
}
}

**
 * This code creates a list of objects containing MyBean -
 * a Java Bean containing one field which is not bean 
 * It then attempts and fails to use a bean encoder 
 * to make a DataSet
 */
public class DatasetTest {
public static final Random RND = new Random();
public static final int LIST_SIZE = 100;

public static String makeName() {
return Integer.toString(RND.nextInt());
}

public static MyUnBean makeUnBean() {
return new MyUnBean(RND.nextInt(), makeName());
}

public static MyBean makeBean() {
return new MyBean(RND.nextInt(), makeName(), makeUnBean());
}

/**
 * Make a list of MyBeans
 * @return
 */
public static List makeBeanList() {
List holder = new ArrayList();
for (int i = 0; i < LIST_SIZE; i++) {
holder.add(makeBean());
}
return holder;
}

public static SQLContext getSqlContext() {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("BeanTest") ;
Option option = sparkConf.getOption("spark.master");
if (!option.isDefined())// use local over nothing
sparkConf.setMaster("local[*]");
JavaSparkContext ctx = new JavaSparkContext(sparkConf) ;
return new SQLContext(ctx);
}


public static void main(String[] args) {
SQLContext sqlContext = getSqlContext();

Encoder evidence = Encoders.bean(MyBean.class);
Encoder evidence2 = 
Encoders.javaSerialization(MyUnBean.class);

List holder = makeBeanList();
 // fails at this line with
// Exception in thread "main" java.lang.UnsupportedOperationException: no 
encoder found for com.lordjoe.testing.MyUnBean

Dataset beanSet  = sqlContext.createDataset( holder, evidence);

long count = beanSet.count();
if(count != LIST_SIZE)
throw new IllegalStateException("bad count");

}
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13573) Open SparkR APIs (R package) to allow better 3rd party usage

2016-03-01 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174726#comment-15174726
 ] 

Sun Rui commented on SPARK-13573:
-

[~chipsenkbeil] glad to know Toree is to support SparkR. I tried it and can't 
figure out how to interact with SparkR. Could you describe how Toree uses the 
methods to provide interaction with SparkR?

> Open SparkR APIs (R package) to allow better 3rd party usage
> 
>
> Key: SPARK-13573
> URL: https://issues.apache.org/jira/browse/SPARK-13573
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Chip Senkbeil
>
> Currently, SparkR's R package does not expose enough of its APIs to be used 
> flexibly. That I am aware of, SparkR still requires you to create a new 
> SparkContext by invoking the sparkR.init method (so you cannot connect to a 
> running one) and there is no way to invoke custom Java methods using the 
> exposed SparkR API (unlike PySpark).
> We currently maintain a fork of SparkR that is used to power the R 
> implementation of Apache Toree, which is a gateway to use Apache Spark. This 
> fork provides a connect method (to use an existing Spark Context), exposes 
> needed methods like invokeJava (to be able to communicate with our JVM to 
> retrieve code to run, etc), and uses reflection to access 
> org.apache.spark.api.r.RBackend.
> Here is the documentation I recorded regarding changes we need to enable 
> SparkR as an option for Apache Toree: 
> https://github.com/apache/incubator-toree/tree/master/sparkr-interpreter/src/main/resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13604:
-
Description: 
If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers. We should SPARK-13604


  was:
If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers. We should 



> Sync worker's state after registering with master
> -
>
> Key: SPARK-13604
> URL: https://issues.apache.org/jira/browse/SPARK-13604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If Master cannot talk with Worker for a while and then network is back, 
> Worker may leak existing executors and drivers. We should SPARK-13604



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13604:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Sync worker's state after registering with master
> -
>
> Key: SPARK-13604
> URL: https://issues.apache.org/jira/browse/SPARK-13604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> If Master cannot talk with Worker for a while and then network is back, 
> Worker may leak existing executors and drivers. We should sync worker's state 
> after registering with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13604:
-
Description: 
If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers. We should sync worker's state after 
registering with master.


  was:
If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers. We should SPARK-13604



> Sync worker's state after registering with master
> -
>
> Key: SPARK-13604
> URL: https://issues.apache.org/jira/browse/SPARK-13604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If Master cannot talk with Worker for a while and then network is back, 
> Worker may leak existing executors and drivers. We should sync worker's state 
> after registering with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174724#comment-15174724
 ] 

Apache Spark commented on SPARK-13604:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11455

> Sync worker's state after registering with master
> -
>
> Key: SPARK-13604
> URL: https://issues.apache.org/jira/browse/SPARK-13604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If Master cannot talk with Worker for a while and then network is back, 
> Worker may leak existing executors and drivers. We should sync worker's state 
> after registering with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-03-01 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn closed SPARK-13586.
---
Resolution: Invalid

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>Priority: Minor
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13604:
-
Description: 
If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers. We should 


  was:
If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers.



> Sync worker's state after registering with master
> -
>
> Key: SPARK-13604
> URL: https://issues.apache.org/jira/browse/SPARK-13604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If Master cannot talk with Worker for a while and then network is back, 
> Worker may leak existing executors and drivers. We should 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13604:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Sync worker's state after registering with master
> -
>
> Key: SPARK-13604
> URL: https://issues.apache.org/jira/browse/SPARK-13604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> If Master cannot talk with Worker for a while and then network is back, 
> Worker may leak existing executors and drivers. We should sync worker's state 
> after registering with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13604) Sync worker's state after registering with master

2016-03-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13604:


 Summary: Sync worker's state after registering with master
 Key: SPARK-13604
 URL: https://issues.apache.org/jira/browse/SPARK-13604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


If Master cannot talk with Worker for a while and then network is back, Worker 
may leak existing executors and drivers.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-03-01 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174720#comment-15174720
 ] 

Sun Rui commented on SPARK-13525:
-

the interactive R session is for your driver, Rscript is needed for launching R 
workers.

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>  

[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174691#comment-15174691
 ] 

Joseph K. Bradley commented on SPARK-13073:
---

It sounds reasonable to provide the same printed summary in Scala, Java, and 
Python as in R.  Perhaps it can be provided as a toString method for the 
LogisticRegressionModel.summary member?

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2016-03-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174685#comment-15174685
 ] 

Joseph K. Bradley commented on SPARK-13030:
---

I agree this is an issue, but I think we need to keep the same number of 
categories between training & test.  A reasonable fix might be to add an option 
for creating an additional "unknown" bucket during training, and putting all 
new categories into this bucket during testing.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13574) Improve parquet dictionary decoding for strings

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174656#comment-15174656
 ] 

Apache Spark commented on SPARK-13574:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/11454

> Improve parquet dictionary decoding for strings
> ---
>
> Key: SPARK-13574
> URL: https://issues.apache.org/jira/browse/SPARK-13574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Priority: Minor
>
> Currently, the parquet reader will copy the dictionary value for each data 
> value. This is bad for string columns as we explode the dictionary during 
> decode. We should instead, have the data values point to the safe backing 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13603) SQL generation for subquery

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13603:


Assignee: Davies Liu  (was: Apache Spark)

> SQL generation for subquery
> ---
>
> Key: SPARK-13603
> URL: https://issues.apache.org/jira/browse/SPARK-13603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Generate SQL for subquery expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13603) SQL generation for subquery

2016-03-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174649#comment-15174649
 ] 

Apache Spark commented on SPARK-13603:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11453

> SQL generation for subquery
> ---
>
> Key: SPARK-13603
> URL: https://issues.apache.org/jira/browse/SPARK-13603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Generate SQL for subquery expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13603) SQL generation for subquery

2016-03-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13603:


Assignee: Apache Spark  (was: Davies Liu)

> SQL generation for subquery
> ---
>
> Key: SPARK-13603
> URL: https://issues.apache.org/jira/browse/SPARK-13603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Generate SQL for subquery expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13603) SQL generation for subquery

2016-03-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13603:
--

 Summary: SQL generation for subquery
 Key: SPARK-13603
 URL: https://issues.apache.org/jira/browse/SPARK-13603
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Generate SQL for subquery expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174642#comment-15174642
 ] 

Reynold Xin commented on SPARK-13596:
-

Are those dot files even possible to move?


> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13548) Move tags and unsafe modules into common

2016-03-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13548.
-
   Resolution: Fixed
Fix Version/s: 2/

> Move tags and unsafe modules into common
> 
>
> Key: SPARK-13548
> URL: https://issues.apache.org/jira/browse/SPARK-13548
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2/
>
>
> Similar to SPARK-13529, this removes two top level directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13548) Move tags and unsafe modules into common

2016-03-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13548:

Fix Version/s: (was: 2/)
   2.0.0

> Move tags and unsafe modules into common
> 
>
> Key: SPARK-13548
> URL: https://issues.apache.org/jira/browse/SPARK-13548
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Similar to SPARK-13529, this removes two top level directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >